KEMBAR78
Unit I - BE v-ACA - 2024 - Lecture Notes | PDF | Parallel Computing | Central Processing Unit
0% found this document useful (0 votes)
54 views138 pages

Unit I - BE v-ACA - 2024 - Lecture Notes

The document outlines the syllabus and lecture notes for an Advanced Computer Architecture course at Osmania University, focusing on parallelism, hardware technologies, memory systems, and software for parallel programming. It covers various topics including parallel computer models, performance metrics, and classifications of parallel architectures based on instruction and data streams. Additionally, it discusses high-performance computing and the advantages of parallel processors in enhancing computational speed and efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views138 pages

Unit I - BE v-ACA - 2024 - Lecture Notes

The document outlines the syllabus and lecture notes for an Advanced Computer Architecture course at Osmania University, focusing on parallelism, hardware technologies, memory systems, and software for parallel programming. It covers various topics including parallel computer models, performance metrics, and classifications of parallel architectures based on instruction and data streams. Additionally, it discusses high-performance computing and the advantages of parallel processors in enhancing computational speed and efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

University College of Engg,

Osmania University
BE - Computer Science & Engineering

Subject: Advanced Computer Architecture

Prof. Krishnaiah Dayyala


( Krish)

Unit I- Lecture Notes

1
Advanced Computer Architecture - Syllabus University College of Engg,
Osmania University

UNIT -1 Theory of Parallelism


Theory of Parallelism: Parallel Computer Models, The State of Computing, Multiprocessors and Multicomputer ,Multivector and SIMD
Computers ,PRAM and VLSI Models, Program and Network Properties ,Conditions of Parallelism, Program Partitioning and Scheduling, Program
Flow Mechanisms, System Interconnect Architectures, Principles of Scalable Performance, Performance Metrics and Measures, Parallel
Processing Applications, Speedup Performance Laws, Scalability Analysis and Approaches.
UNIT-2 Hardware Technologies
Hardware Technologies: Processors and Memory Hierarchy, Advanced Processor Technology, Superscalar and Vector Processors, Memory
Hierarchy Technology, Virtual Memory Technology.
UNIT – 3 Bus, Cache, and Shared Memory
Bus, Cache, and Shared Memory ,Bus Systems ,Cache Memory Organizations ,Shared Memory Organizations ,Sequential and Weak Consistency
Models ,Pipelining and Superscalar Techniques ,Linear Pipeline Processors ,Nonlinear Pipeline Processors ,Instruction Pipeline Design
,Arithmetic Pipeline Design
UNIT-4 Parallel and Scalable Architectures
Parallel and Scalable Architectures: Multiprocessors and Multicomputers ,Multiprocessor System Interconnects, Cache Coherence and
Synchronization Mechanisms, Three Generations of Multicomputers ,Message-Passing Mechanisms ,Multivector and SIMD Computers ,Vector
Processing Principles ,Multivector Multiprocessors ,Compound Vector Processing ,SIMD Computer Organizations, Scalable, Multithreaded, and
Dataflow Architectures, Latency-Hiding Techniques, Principles of Multithreading, Fine-Grain Multicomputers, Scalable and Multithreaded
Architectures, Dataflow and Hybrid Architectures.
UNIT -5 Software for parallel programming
Software for parallel programming: Parallel Models, Languages, and Compilers ,Parallel Programming Models, Parallel Languages and
Compilers ,Dependence Analysis of Data Arrays ,Parallel Program Development and Environments, Synchronization and Multiprocessing
Modes. Instruction and System Level Parallelism, Instruction Level Parallelism ,Computer Architecture ,Contents, Basic Design Issues ,Problem
Definition ,Model of a Typical Processor ,Compiler-detected Instruction Level Parallelism ,Operand Forwarding ,Reorder Buffer, Register
Renaming ,Tomasulo’s Algorithm ,Branch Prediction, Limitations in Exploiting Instruction Level Parallelism ,Thread Level Parallelism.
2
UNIT -1 Theory of Parallelism University College of Engg,
Osmania University

▪ Theory of Parallelism ▪ Principles of Scalable Performance


▪ Parallel Computer Models ▪ Performance Metrics and Measures
▪ The State of Computing ▪ Parallel Processing Applications
▪ Multiprocessors and Multicomputer ▪ Speedup Performance Laws
▪ Multi-vector and SIMD Computers ▪ Scalability Analysis and Approaches
▪ PRAM and VLSI Models
▪ Program and Network Properties
▪ Conditions of Parallelism
▪ Program Partitioning and Scheduling
▪ Program Flow Mechanisms
▪ System Interconnect Architectures

3
University College of Engg,
Osmania University

UNIT-I

4
University College of Engg,
Osmania University

Parallel Processing
overview

5
Computer - Basics University College of Engg,
Osmania University

Application Software

Operating System

Firmware/BIOS

Hardware

6
High Level Arch University College of Engg,
Osmania University

7
High Performance Computing University College of Engg,
Osmania University

High-performance computing (HPC) is the practice of using parallel data processing to improve computing
performance and perform complex calculations. HPC achieves these goals by aggregating computing power,
so even advanced applications can run efficiently, reliably and quickly as per user needs and expectations. It
thus delivers much higher power and better performance than traditional computers, workstations and servers

COMPUTE NETWORK STORAGE


+ +

Super Parallel Cloud Distributed Quantum


Computing Computing Computing Computing Computing

8
Parallel Processing University College of Engg,
Osmania University

❑ Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data
processing tasks for the purpose of increasing the computational speed of a computer system.

❑ Instead of processing each instruction sequentially as in a conventional computer, a parallel process system is able to
perform concurrent data processing to achieve faster execution time.

❑ For example, while an instruction is being executed in the ALU, the next instruction can be read from memory. The
system may have two or more ALUs and be able to execute two or more instructions at the same time. Furthermore,
the system may have two or more processor operating concurrently.

Parallel Processingng is defined as “Execution of Concurrent Events in


the computing process to achieve faster Computational Speed”

9
Why Parallel Processors? University College of Engg,
Osmania University

Parallel processors :

It is defined as computer systems consisting of multiple processing units connected via some interconnection network plus the software
needed to make the processing units work together.

There are two major factors used to categorize such systems:

1. The processing units themselves,


2. The interconnection network that ties them together.

➢ The processing units can communicate and interact with each other using either shared memory or message passing
methods. The interconnection network for shared memory systems can be classified as bus-based versus switch-based.

➢ In message passing systems, the interconnection network is divided into static and dynamic. Static connections have a fixed
topology that does not change while programs are running. Dynamic connections create links on the fly as the program
executes.

➢ The main Objective: is to create powerful computers by simply connecting multiple processors. A multiprocessor is expected
to reach faster speed than the fastest single-processor system. In addition, a multiprocessor consisting of a number of single
processors is expected to be more cost-effective than building a high-performance single processor.

➢ Another advantage of a multiprocessor is fault tolerance.

➢ If a processor fails, the remaining processors should be able to provide continued service, albeit with degraded performance.
10
Why Parallel Processors? University College of Engg,
Osmania University

The key factors that influence the performance of computer/processor

11
University College of Engg,
Osmania University

Parallel Computer
Models

12
University College of Engg,
Classification of Parallel Computers Osmania University

The most universally excepted method of classifying computer systems is as per Flynn’s
taxonomy the classification
Flynn’s classification scheme is based on the notion of a stream of information. Two types of information flow into a
processor: Instructions and Data.

➢ The instruction stream is defined as the sequence of instructions performed by the processing unit.
➢ The data stream is defined as the data traffic exchanged between the memory and the processing unit.
➢ According to Flynn’s classification, either of the instruction or data streams can be single or multiple.

As per Fynn’s methodology, Computer architecture can be classified into the following 4 distinct categories:

1. Single- Instruction Single- Data streams (SISD)


2. Single- Instruction Multiple- Data streams (SIMD)
3. Multiple- Instruction Single- Data streams (MISD)
4. Multiple- Instruction Multiple- Data streams (MIMD)

13
Parallel Processing- Classification of Computers University College of Engg,
Osmania University

Architectural Classification

– Flynn's classification
» Based on the multiplicity of Instruction Streams and
Data Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor

Number of Data Streams


Single Multiple

Number of Single SISD SIMD


Instruction
Streams Multiple MISD MIMD

14
Parallel Processing- Classification University College of Engg,
Osmania University
SISD:

❑ SISD represents the organization of a single computer containing a control unit, a processor unit, and a memory unit.
❑ Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities.
❑ Parallel processing in this case may be achieved by means of multiple functional units or by pipeline processing.

SIMD :

❑ SIMD represents an organization that includes many processing units under the supervision of a common control unit.
❑ All processors receive the same instruction from the control unit but operate on different items of data.
❑ The shared memory unit must contain multiple modules so that it can communicate with all the processors
simultaneously.

MISD:

MISD structure is only of theoretical interest since no practical system has been constructed using this organization.

MIMD :

❑ MIMD organization refers to a computer system capable of processing several programs at the same time.
❑ Most multiprocessor and multi computer systems can be classified in this category.
15
University College of Engg,
Osmania University

SISD: Single Instruction Single Data stream

Instructions
Processing Main memory
element (PE) (M)
Data

IS

IS DS
Control Unit PE Memory

16
University College of Engg,
SIMD: Single Instruction Multiple Data Streams Osmania University

Fine-grained
• Image processing application
• Large number of PEs
• Minimum complexity PEs
• Programming language is a simple extension of a sequential
language

Coarse-grained
• Each PE is of higher complexity and it is usually built with
commercial devices
• Each PE has local memory

Applications:
• Image processing
• Matrix manipulations
• Sorting

17
MIMD
University College of Engg,
Osmania University

• Multiple-instruction multiple-data streams (MIMD) parallel


architectures are made of multiple processors and multiple memory
modules connected together via some interconnection network.

• They fall into two broad categories:


• shared memory
• message passing

• Processors exchange information through their central shared


memory in shared memory systems, and exchange information
through their interconnection network in message passing systems.

• A shared memory system typically accomplishes Interprocessor


coordination through a global memory shared by all processors.
These are typically server systems that communicate through a bus
and cache memory controller.

18
MISD
University College of Engg,
Osmania University

Applications:
• Classification
• Robot vision

• In the MISD category, the same stream of data


flows through a linear array of processors
executing different instruction streams.
• In practice, there is no viable MISD machine
• Not very practical and it is only theoretical

19
Flynn taxonomy
University College of Engg,
Osmania University

– Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system (?)
– Disadvantages of Flynn
» Very coarse-grain differentiation among machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not considered in the scheme

20
Further classification University College of Engg,
Osmania University

Shared Memory Multiprocessors

21
Further classification University College of Engg,
Osmania University

1. Classification based on the memory arrangement

2. Classification based on communication

3. Classification based on the kind of parallelism


• Data-parallel
• Function-parallel

22
University College of Engg,
Osmania University
1. Classification based on memory arrangement

Shared memory

I/O1 Interconnection
Interconnection network
network
I/On
PE1 PEn
PE1 PEn M1 Mn

Processors P1 Pn

Shared memory - multiprocessors Message passing – multi-computers

23
University College of Engg,
Shared-memory multiprocessors Osmania University

1. Uniform Memory Access (UMA)


2. Non-Uniform Memory Access (NUMA)
3. Cache-only Memory Architecture (COMA)

Memory is common to all the processors.


Processors easily communicate by means of shared variables.

24
The UMA Model University College of Engg,
Osmania University

• Physical memory is uniformly shared by all processors


• All processors (PE1….PEn) take equal access time to memory – Thus its termed as Uniform Memory Access Computers
• Each PE can have its own private Cache
• High degree of resource sharing(memory and I/O ) – Tightly Coupled
• Interconnection Network can be – Common bus, cross bar switch or Multistage n/w ( discussed later)
• When all PE’s have equal access to all peripheral devices – Symmetric Multiprocessor
• In Asymmetric multiprocessor only one subset of processors have peripheral access. Master Processors control Slave
(attached) processors.

25
The UMA Model University College of Engg,
Osmania University

Applications of UMA Model

• Suitable for general purpose and time sharing application by multiple


users
• Can be used to speed up execution of a single program in time critical
application

Limitations of the Model

• Interacting process cause simultaneous access to same locations – cause


problem when an update is followed by read operation (old value will be
read)
• Poor Scalability – as no: of processors increase –shared memory area
increase-thus n/w becomes bottleneck.
• No: of processors usually in range(10-100)

26
THE NUMA MODEL University College of Engg,
Osmania University
• Access time varies with location of memory
• Shared memory is distributed to all processors – Local memories
• Collection of all local memories forms global memory space accessible by all
processors.
• Its faster to access content within local memory of a processor than. to
access remote memory attached to another processor (delay through
interconnection)
• Thus named as NON-Uniform Memory access – because access time
depends on whether data available in local memory of the processor
itself or not.

Advantage Reduces n/w bottleneck issue that occurs in UMA as


processors have a direct access path to attached local memory.

3 types of Memory access pattern

I. Fastest to access – local memory of PE itself


II. Next fastest – global memory shared by PE/Cluster
III. Slowest to access – remote memory(local memory of PE from another Cluster)

27
Cache-only Memory Architecture (COMA) University College of Engg,
Osmania University

• Multiprocessor + Cache Memory = COMA Model


• Multiprocessor using cache only memory
• Ex: Data diffusion Machine , KSR-1 machine
• Special case of NUMA machine in which distributed main
memories are converted to caches – all caches together form
a global address space
• Remote cache access is assisted by Distributed Cache
directories (D )

• Application of COMA

• - General purpose multiuser applications

28
University College of Engg,
Osmania University
Symmetric and asymmetric multiprocessors

Symmetric:
- all processors have equal access to all peripheral devices.
- all processors are identical.
Asymmetric:
- one processor (master) executes the operating system
- other processors may be of different types and may be dedicated to special tasks.

29
Further classification University College of Engg,
Osmania University

System Interconnect Architectures

30
University College of Engg,
Osmania University

How do we connect multiple


processors?
• Bandwidth
• LAN • Latency
• WAN • Data Routing functionality
• Copper • Permutations
• Fiber • Broadcast
• Wireless • Multicast
• Scalability
• Reliability
University College of Engg,
Osmania University
2. Classification based on type of interconnections

Static networks

Dynamic networks

32
Further classification University College of Engg,
Osmania University

Speedup Performance Laws

33
Unit I- Performance Metrics and Measures University College of Engg,
Osmania University

Performance of a machine is determined by


CPI

Cock Cyle Time Instruction Count

34
Unit I- Performance Metrics and Measures University College of Engg,
Osmania University

Latency: How long does it take to get a particular task


done?
Throughput: How many tasks can you perform in a
unit of time?
Performance
1
Performance  Execution time

Execution time (clock time)


User time
System time
Other time

35
Program Execution by a CPU- Flow Chart University College of Engg,
Osmania University

CPU Execution Time = number of Instructions * CPI * clock Cycle Time


36
Unit I- Performance Metrics and Measures University College of Engg,
Osmania University

Example:
A program is written and the details are as follows:

5 Memory instructions
10 Addition instructions
5 Square root instructions

• CPU Clock Speed is : 5 MHz


• MI- 2 cycles
• Addition: 5 cycles
• Square root: 50 Cycles
Calculation the CPU time for executing this program?
37
Unit I- Performance Metrics and Measures University College of Engg,
Osmania University

Clock rate (frequency) = cycles per second (1 Hz = 1 cycle/sec)

CPI Cycles per instruction – smaller is better

IPC Instruction per cycle – bigger is better

Instruction count * CPI


CPU time =
Clock rate

Weighted CPI
n
CPU time =(  (CPIi * Ii))/clock rate
i=1

38
University College of Engg,
Performance Laws Osmania University

1.Amdahl’s Law of fixed load


2.Gustafson’s Law doe scaled problems
3.Sun Ni’s law ( Memory bounded Speedup model)
UNIT -1 Theory of Parallelism University College of Engg,
Osmania University

▪ Theory of Parallelism ▪ Principles of Scalable Performance


▪ Parallel Computer Models ▪ Performance Metrics and Measures
▪ The State of Computing ▪ Parallel Processing Applications
▪ Multiprocessors and Multicomputer ▪ Speedup Performance Laws
▪ Multi-vector and SIMD Computers ▪ Scalability Analysis and Approaches
▪ PRAM and VLSI Models
▪ Program and Network Properties
▪ Conditions of Parallelism
▪ Program Partitioning and Scheduling
▪ Program Flow Mechanisms
▪ System Interconnect Architectures

40
Further classification University College of Engg,
Osmania University

Amdahl’s Law

41
Amdahl's law- background University College of Engg,
Osmania University

In the mid-1900s, digital computers were the bleeding edge of technology. It was around this
time, in 1952, that Gene Amdahl graduated from the University of Wisconsin and joined IBM.
Amdahl had already made a splash at UW by helping design and build the WISC, the Wisconsin
Integrally Synchronized Computer, which was the first digital computer to be built in the state.

Amdahl's law, or argument, was first presented in 1967 at the AFIPS Joint Computer
Conference. During the presentation, Amdahl showed how the performance improvement from
parallel processing for a sequence of operations was limited.
Even if certain operations could be sped-up by being performed in parallel, other operations that
could not, such as reading or writing data, would limit how fast the system could be improved.
Today, the law has been used to demonstrate that the components that make the larger
proportion of the system are the best to focus on for improvements and parallelization.

42
Parallel Processing Performance Laws University College of Engg,
Osmania University
Amdahl's law was proposed in the 1950s by Gene Amdahl. The Law says “Components
of a system that cannot be improved will limit the overall possible improvement of the
system, such as from parallel processing.”

In computer architecture, Amdahl's law (or Amdahl's argument[1]) is a formula which


gives the theoretical speedup in latency of the execution of a task at fixed workload that
can be expected of a system whose resources are improved.

The performance improvement to be gained from using some faster mode of


execution is limited by the fraction of the time the faster mode can be used

43
University College of Engg,
Osmania University
The theoretical speedup of the latency (via a reduction of latency, ie: latency as a metric is elapsed time between an input and
output in a system) of the execution of a program as a function of the number of processors executing it, according to
Amdahl's law. The speedup is limited by the serial part of the program. For example, if 95% of the program can be
parallelized, the theoretical maximum speedup using parallel computing would be 20 times.

44
Example University College of Engg,
Osmania University

20 hours

A B
must walk 200 miles

Walk 4 miles /hour


Bike 10 miles / hour
Car-1 50 miles / hour
Car-2 120 miles / hour
Car-3 600 miles /hour

45
Example University College of Engg,
Osmania University

20 hours

A B
must walk 200 miles

Walk 4 miles /hour → 50 + 20 = 70 hours S=1


Bike 10 miles / hour → 20 + 20 = 40 hours S = 1.8
Car-1 50 miles / hour → 4 + 20 = 24 hours S = 2.9
Car-2 120 miles / hour → 1.67 + 20 = 21.67 hours S = 3.2
Car-3 600 miles /hour → 0.33 + 20 = 20.33 hours S = 3.4

46
Speedup - Derivation University College of Engg,
Osmania University

• Time (one CPU): T(1)


• Time (n CPUs): T(n)
• Speedup: S
• S = T(1)/T(n)
•  : The fraction of the program that is naturally serial
• 1- ): The fraction of the program that is naturally
parallel
S = T(1)/T(N)
T(1)(1-  )
T(N) = T(1) +
N
1 N
S= =
+ (1-  ) N + (1-  )
N

47
University College of Engg,
Osmania University

48
University College of Engg,
Osmania University

49
University College of Engg,
Osmania University

50
University College of Engg,
Osmania University
Problem:
In an enhancement of a design of a CPU, the speed of a floating point unit has been increased by 20%
and the fixed point unit has been increased by 10%. What is overall speedup achieved if the ratio of
the no. of floating point operations to no. of fixed point operation is 2:3 and floating point operation
used to take twice the time taken by the fixed point operation in original design.

Solution:

51
Problem: University College of Engg,
Osmania University
University College of Engg,
Osmania University
Problem:

53
Further classification University College of Engg,
Osmania University

Gustafson’s Law

54
University College of Engg,
Osmania University

Gustafson’s Law - Definition

Gustafson’s Law says that if you apply N processors to a


task that has serial fraction s, scaling the task to take the
same amount of time as before, the speedup is Speedup
= s + N (1− s) = P−f (P−1).

55
University College of Engg,
Osmania University

• The execution time of a program running on a parallel system can be split into two parts:

• a part that does not benefit from the increasing number of processors (serial part);

• a part that benefits from the increasing number of processors (parallel part).

Example. — A computer program that processes files from disk. A part of that program may scan the
directory of the disk and create a list of files internally in memory. After that, another part of the program
passes each file to a separate thread for processing. The part that scans the directory and creates the file
list cannot be sped up on a parallel computer, but the part that processes the files can.

• Without loss of generality, let the total execution time on the parallel system be
• T =1
• Denote the serial time as s and the parallel time as p
• where
• s+p=1
• Denote the number of processors as N

56
University College of Engg,
Gustafson’s Law Osmania University

T1 = s+ Np ( no parallel)
T2= s+p ( parallel
Speed up= T1/T2
= ( s+Np) / T2
Where
T2 = 1
• s serial portion processing time
Therefore= Speed up = s+Np
• p – parallel processor time
p= 1- s
• N- Number of parallel processors
= s+ N ( 1-s)
= s+ N-NS
S= N - s( N-1)

57
University College of Engg,
Osmania University

Key Note:

if the workload is scaled up to


maintain a fixed execution time
as the number of processors
increases, the speedup increases
linearly. What Gustafson’s law
says is that
the true parallel power of a large
multiprocessor system is only
achievable when a large
parallel problem is applied.

58
University College of Engg,
Osmania University

Problem 1:

Let’s assume that 5% of a program is sequential (i.e., α=0.05\alpha = 0.05α=0.05) and


the rest is parallelizable. Compute the speedup for 8 processors.
Solution:
s=0.05\alpha = 0.05α=0.05
p=8p = 8p=8
Using Gustafson's Law:
S(8)= 8−0.05×(8−1)
=8−0.05×7
=8−0.35=7.65

So the speedup with 8 processors is 7.65.

59
University College of Engg,
Osmania University

Problem 2:

If 20% of the program is sequential , calculate the speedup on a system with 16 processors.

Solution:
s=0.20
p=16

Using the formula:


S(16)=16−0.20×(16−1)
=16−0.20×15
=16−3=13
The speedup with 16 processors is 13.

60
Further classification University College of Engg,
Osmania University

Sun & Ni’s Law


(Memory Bounded Speed up law)

61
University College of Engg,
Osmania University

The memory-bounded speedup model is the


first work to reveal that memory is the
performance constraint for high-end computing
and presents a quantitative mathematical
formulation for the trade-off between memory
and computing.
University College of Engg,
Osmania University
Sun & Ni’s Law (1993)

This law defines a memory bounded speedup model which generalizes both Amdahl’s Law
and Gustafson’s Law to maximize the use of both processor and memory capacities.

The idea is to solve maximum possible size of problem, limited by memory capacity

This inherently demands an increased or scaled work load, providing higher speedup,
Higher efficiency, and better resource (processor & memory) utilization. But may result in
slight increase in execution time to achieve this scalable speedup performance!
Speedup University College of Engg,
Performance Laws Osmania University

Sun & Ni’s Law

According to this law, the speedup S*(n) in the performance can be defined by:

▪ α= is the parallel portion of the work load that can be parallelized


▪ G(N) = is the In this equation, represents the influence of memory change on the change in problem size.
▪ n= number of parallel processors
▪ Assumptions made while deriving the above expression:
▪ A global address space is formed from all individual memory spaces i.e. there is a distributed shared
memory space
▪ All available memory capacity of used up for solving the scaled problem
University College of Engg,
Osmania University

Multiprocessors and
Multicomputers

65
University College of Engg,
Multiprocessors and Multicomputer - Summary Osmania University

Multiprocessors Multicomputer
1. Single computer with multiple processors 1. Multiple autonomous computers
2. Each PE’s(CPU/processing elements) do not have 2. Each PE’s has its own memory and resources – no
their own individual memories – memory and I/O sharing – Thus called Distributed Memory
resources are shared – Thus called Shared Memory Multicomputers
Multiprocessors 3. Communication between PE’s not mandatory
3. Communication between PE’s a must 4. Loosely coupled as there is no resource sharing
4. Tightly coupled – due to high degree of resource 5. Use Static Network – connection of switching units is
sharing fixed
5. Use Dynamic Network – thus communication links 6. NORMA model/ Distributed- Memory Multicomputer
can be reconfigured
6. 3 Types: UMA model, NUMA model, COMA model

66
3. Classification based on the kind of parallelism University College of Engg,
Osmania University

Parallel architectures
PAs

Data-parallel architectures Function-parallel architectures

Instruction-level Thread-level Process-level


PAs
PAs PAs

DPs
ILPS MIMDs

Vector Associative SIMDs Systolic Pipelined VLIWs Superscalar Ditributed Shared


and neural architecture processors processors memory memory
architecture architecture MIMD (multi-
(multi-computer) Processors)
67
Multicomputers University College of Engg,
Osmania University

Multicomputer: The system consists of multiple computers ( Nodes) interconnected by a message-passing


network. Each node is an autonomous computer consisting of a processor, local memory, and sometimes
attached disks or I/Os.

68
Parallel Computer Models University College of Engg,
Osmania University

How do they Operate?


• The message-passing network provides point-to-point static connections among the nodes. All local memories are private
and are accessible only by local processors.
• lnternode communication is carried out by passing messages through the static connection network.
• With advances in interconnection and network technologies, this model of computing has gained importance, because of
its suitability for certain applications, scalability, and fault-tolerance.

• Modern Multicomputers use hardware routers to pass messages. A computer node is attached to each router. The
boundary router may be connected to I/O and peripheral devices.

• Message passing between any two nodes involves a sequence of routers and channels.
• Mixed types of nodes are allowed in a heterogeneous multicomputer. The inter node communication in a heterogeneous
multicomputer is achieved through compatible data representations and message-passing protocols.

• Commonly used topologies include the ring, tree. Mesh, hypercube etc.

• Various communication patterns are demanded among the nodes, such as one-to-one, broadcasting, permutations, and
multicast patterns

69
University College of Engg,
Osmania University

Multi-vector and
SIMD Computers

70
SIMD computers University College of Engg,
Osmania University

SIMD computers

71
SIMD Computer University College of Engg,
Osmania University

• Single instruction, multiple data (SIMD) is a type of parallel processing in


Flynn's taxonomy.

• SIMD describes computers with multiple processing elements that perform


the same operation on multiple data points simultaneously.

• Such machines exploit data level parallelism, but not concurrency: there are
simultaneous (parallel) computations, but each unit performs the exact same
instruction at any given moment (just with different data).

• Most modern CPU designs include SIMD instructions to improve the


performance of multimedia use.

• Most SIMD computers use a single control unit and distributed memories
• The instruction set of an SIMD computer is decoded by the array control unit.
• The processing elements (PEs) in the SIMD array are passive ALUs executing instructions broadcast from the control unit.
• All PEs must operate in lockstep, synchronized by the same array controller.

72
SIMD Computers University College of Engg,
Osmania University

• SIMD computer have One Control Processor and several Processing Elements.

• All Processing Elements execute the same instruction at the same time.

• Interconnection network between PEs determines memory access and PE Interaction.

• SIMD computer ( array processor) is normally interfaced to a host computer through the
control unit. The host computer is a general purpose machine which serves as the
operating manager of the entire system.

• Each PEi is essentially an ALU with attached working registers and local memory PEMi
for the storage of distributed data. The CU also has its own main memory for storage of
programs. The function of CU is to decode all instruction and determine where the
decoded instructions should be executed. Scalar or control type instructions are directly
executed inside the CU. Vector instructions are broadcasted to the PEs for distributed
execution.

• Masking schemes are used to control the status of each PE during the execution of a
vector instruction.

• Each PE may be either active or disabled during an instruction cycle. A masking vector is
used to control the status of all PEs. Only enabled PEs perform computation. Data
exchanges among the PEs are done via an inter-PE communication network, which
performs all necessary data routing and manipulation functions. This interconnection
network is under the control of the control unit.
73
SIMD Computers University College of Engg,
Osmania University

An operational MODEL SIMD Computer define 5 tuple :


M= { N,C,I,M, R}

Where:

1. N is the number of processing elements (PEs) in the machine. For example, the Iliac IV has 64 PEs
and the Connection Machine CM-2 uses 65,536 PEs.
2. C is the set of instructions directly executed by the control unit (CU), including scalar and program
flow control instructions.
3. I is the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations executed by each active PE over
data within that PE.
4. M is the set of masking schemes, where each mask partitions the set of PEs into enabled and
disabled subsets.
5. R is the set of data-routing functions, specifying various patterns to be set up in the interconnection
network for inter-PE communications.

74
SIMD Computers University College of Engg,
Osmania University
• A distributed-memory SIMD computer consists of an array of PEs which are
controlled by the same array control unit
• Program and data are loaded into the control memory through the host • Distributed-Memory Model :
computer.
• An instruction is sent to the control unit for decoding. If it is a scalar or
program control operation, it will be directly executed by a scalar processor
attached to the control unit.
• If the decoded instruction is a vector operation, it will be broadcast to all the
PEs for parallel execution.
• Partitioned data sets are distributed to all the local memories attached to the
PEs through a vector data bus.
• The PEs are interconnected by a data-routing network which performs inter-
PE data communications such as shifting, permutation, and other routing
operations.
• The data-routing network is under program control through the control unit.
• The PEs are synchronized in hardware by the control unit.
• In other words, the same instruction is executed by all the PEs in the same
cycle.
• However, masking logic is provided to enable or disable any PE from
participation in a given instruction cycle.

75
SIMD Computers University College of Engg,
Osmania University

• Alignment network is used as the inter-PE


memory communication network. Shared-Memory Model
• This network is controlled by the control unit.
• Example: The Burroughs Scientific Processor
(BSP) has adopted this architecture, with n =16
PEs updating m = 17 shared-memory modules
through a 16 x 17 alignment network.
• The alignment network must be properly set to
avoid access conflicts.

• Most SIMD computers are built with distributed


memories.

76
Multi-vector computers University College of Engg,
Osmania University

Multi-vector computers

77
Multi-vector computers University College of Engg,
Osmania University

Vector Processing:

A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the
vector elements are ordered to have a fixed addressing increment between successive
elements called the stride.

• A vector processor is an ensemble of hardware resources, including vector registers,


functional pipelines, processing elements, and register counters, for performing
vector operations.
• Vector processing occurs when arithmetic or logical operations are applied to vectors.
• It is distinguished from scalar processing which operates on one or one pair of data.
• Vector processing is faster and more efficient than scalar processing.
• Both pipelined processors and SIMD computers can perform vector operations.
• Vector processing reduces software overhead incurred in the maintenance of looping
control, reduces memory-access conflicts, and above all matches nicely with the
pipelining and segmentation concepts to generate one result per each clock cycle
continuously.

78
Vector Processors University College of Engg,
Osmania University

Vector Processors

Vector processors issue one instructions that operate on multiple data items (arrays). This is conducive to pipelining
with one result produced per cycle.

A vector processor is a coprocessor designed to perform vector computations. A vector is a one-dimensional array of
data items (each of the same data type). Vector processors are often used in multipipelined supercomputers.
Architectural types include:
• register-to-register (with shorter instructions and register files)
• memory-to -memory (longer instructions with memory addresses)

79
Multi Vector computer University College of Engg,
Osmania University

- A vector computer is usually built on top of scalar


processor – its attached to scalar processor as an
optional feature
-Program and data are first loaded into main memory
through host computer
-Instructions are first decoded by scalar Control Unit

– if it’s a scalar operation or program control operation, it


will be directly executed using scalar functional pipelines.
-If its vector operation it will be send to the vector control
unit.
The control unit supervises the flow of vector data
between main memory and vector functional pipelines.

Vector data flow is coordinated by the control unit. A


number of vector functional pipelines may be built into a
vector processor

80
Multi Vector computer University College of Engg,
Osmania University
VECTOR PROCESSOR MODELS

1. Register to Register architecture


2. Memory to memory architecture

REGISTER to REGISTER architecture

• The fig above shows a register to register architecture. Vector registers are used to hold vector operands, intermediate and
final vector results.

• All vector registers are programmable

• Length of vector register is usually fixed (ex 64 bit for CRAY Series Supercomputer). Some machines use reconfigurable
vector registers to dynamically match register length(ex: Fujitsu VP2000)

• Generally there are fixed no: of vector registers and functional pipelines in vector processors, hence they must be reserved
in advance to avoid conflicts

MEMORY to MEMORY architecture


• Differs from register to register architecture in use of vector stream unit in place of vector registers.
• Vector operands and results are directly retrieved from and stored into main memory in super words (ex: 512 bits in Cyber
205) Representative

81
University College of Engg,
Osmania University

Pipelining

82
Pipelining- Methodology University College of Engg,
Osmania University

Pipeline processing is an implementation technique where arithmetic suboperations or the phases of a computer
instruction cycle overlap in execution.
❑ Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess being executed in a special
dedicated segment that operates concurrently with all other segments.

❑ A pipeline can be visualized as a collection of processing segments through which binary information flows.

❑ Each segment performs partial processing dictated by the way the task is partitioned.

❑ The result obtained from the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after
the data have passed through all segments.

❑ It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time. The overlapping of
computation is made possible by associating a register with each segment in the pipeline.

❑ The registers provide isolation between each segment so that each can operate on distinct data simultaneously.
❑ Perhaps the simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register followed by a
combinational circuit.
❑ The register holds the data and the combinational circuit performs the suboperation in the particular segment. The output of the
combinational circuit in a given segment is applied to the input register of the next segment. A clock is applied to all registers after enough
time has elapsed to perform al segment activity. In this way the information flows through the pipeline one step at a time.
83
Pipelining- Example University College of Engg,
Osmania University

84
University College of Engg,
Osmania University

Arithmetic Pipelining

85
University College of Engg,
Segment pipeline Osmania University

❑ Any operation that can be decomposed into a sequence of General Structure of a 4-Segment Pipeline
suboperations of about the same complexity can be implemented by
Clock
a pipeline processor.

❑ The technique is efficient for those applications that need to repeat


Input S1 R1 S2 R2 S3 R3 S4 R4
the same task many times with different sets of data.

❑ The operands pass through all four segments in a fixed sequence.

❑ Each segment consists of a combinational circuit S; that performs a


suboperation over the data stream flowing through the pipe.
Space-Time Diagram
❑ The segments are separated by registers R; that hold the
1 2 3 4 5 6 7 8 9
intermediate results between the stages. Clock cycles
Segment 1 T1 T2 T3 T4 T5 T6

❑ Information flows between adjacent stages under the control of a 2 T1 T2 T3 T4 T5 T6


common clock applied to al the registers simultaneously.
3 T1 T2 T3 T4 T5 T6

❑ We define a task as the total operation performed going through all 4 T1 T2 T3 T4 T5 T6


the segments in the pipeline.

86
University College of Engg,
Osmania University

PARAM and VLSI


Models

87
University College of Engg,
Osmania University

Parallel Random Access


Machines
(PRAM)
University College of Engg,
Conventional Computer Osmania University

These are Uni processor


computers that are modeled as
Conventional computer
PRAM Introduction University College of Engg,
Osmania University

• An algorithm is a sequence of steps that take inputs from the user and after
some computation, produces an output.

• A parallel algorithm is an algorithm that can execute several instructions


simultaneously on different processing devices and then combine all the
individual outputs to produce the final result.

• Parallel Random Access Machines (PRAM) is a model, which is


considered for most of the parallel algorithms. In this case, multiple
processors are attached to a single block of memory.
PRAM Model University College of Engg,
Osmania University

• A set of similar type of processors.


• All the processors share a common
memory unit.
• Processors can communicate among
themselves through the shared memory
only.
• A memory access unit (MAU) connects the
processors with the single shared memory.
PRAM Variants University College of Engg,
Osmania University

Exclusive Read Exclusive Write (EREW) − Here no two processors are allowed to read from
or write to the same memory location at the same time.
Exclusive Read Concurrent Write (ERCW) − Here no two processors are allowed to read
from the same memory location at the same time, but are allowed to write to the same memory
location at the same time.
Concurrent Read Exclusive Write (CREW) − Here all the processors are allowed to read from
the same memory location at the same time, but are not allowed to write to the same memory
location at the same time.
Concurrent Read Concurrent Write (CRCW) − All the processors are allowed to read from or
write to the same memory location at the same time.
PRAM model approaches University College of Engg,
Osmania University

1. Shared memory model


2. Message passing model
3. Data parallel model
Shared memory PRAM model University College of Engg,
Osmania University

• Shared memory emphasizes on control parallelism than on data


parallelism.
• In the shared memory model, multiple processes execute on different
processors independently, but they share a common memory space.
• Due to any processor activity, if there is any change in any memory location, it
is visible to the rest of the processors.
• Suppose one is reading that location and the other is writing on that location.
• It may create confusion.
• To avoid this, some control mechanism, like lock / semaphore, is
implemented to ensure mutual exclusion.
Shared memory PRAM model University College of Engg,
Osmania University
Shared Memory Approach University College of Engg,
Osmania University

Thread libraries − The thread library allows multiple threads of control that run concurrently in the same memory
location. Thread library provides an interface that supports multithreading through a library of subroutine.
Distributed Shared Memory (DSM) Systems − DSM systems create an abstraction of shared memory on loosely
coupled architecture in order to implement shared memory programming without hardware support.
Program Annotation Packages − This is implemented on the architectures having uniform memory access
characteristics.
Merit and Demerits of shared memory University College of Engg,
Osmania University

Merits
•Global address space gives a user-friendly programming approach to memory.
•Due to the closeness of memory to CPU, data sharing among processes is fast and
uniform.
•There is no need to specify distinctly the communication of data among processes.
•Process-communication overhead is negligible.
•It is very easy to learn.
Demerits
•It is not portable.
•Managing data locality is very difficult.
Message Passing PRAM Model University College of Engg,
Osmania University

• Message passing is the most commonly used parallel programming approach in distributed memory
systems.
• Here, the programmer has to determine the parallelism.
• In this model, all the processors have their own local memory unit and they exchange data through a
communication network.
Processor Communication University College of Engg,
Osmania University

1. Point-to-Point Communication
2. Collective Communication
3. Message Passing Interface
4. Processors use message-passing libraries for communication , which contain

–The address of the processor from which the message is being sent;
–Starting address of the memory location of the data in the sending processor;
–Data type and size of the sending data;
–The address of the processor to which the message is being sent;
–Starting address of the memory location for the data in the receiving processor.
Merit & Demerit of Message Passing University College of Engg,
Osmania University

Merits
•Provides low-level control of parallelism;
•It is portable;
•Less error prone;
•Less overhead in parallel synchronization and data distribution.

Demerits
•As compared to parallel shared-memory code, message-passing code
generally needs more software overhead.
Data Parallel Programming University College of Engg,
Osmania University

The major focus of data parallel programming model is on performing


operations on a data set simultaneously.
The data set is organized into some structure like an array, hypercube, etc.
Processors perform operations collectively on the same data structure.
Each task is performed on a different partition of the same data structure.
University College of Engg,
Osmania University

Instruction Pipelining

102
University College of Engg,
Osmania University

Timing of Instruction pipeline

4 Segment – Instruction Pipeline

103
Pipeline Conflicts University College of Engg,
Osmania University

In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal
operation.

❑ Resource conflicts caused by access to memory by two segments at the same time. Most of these conflicts
can be resolved by using separate instruction and data memories.

❑ Data dependency conflicts arise when an instruction depends on the result of a previous instruction, but
this result is not yet available.

❑ Branch difficulties arise from branch and other instructions that change the value of PC.

104
UNIT I - Instruction level Parallelism University College of Engg,
Osmania University

Instruction-level Parallelism (ILP) is defined as the parallel or simultaneous


execution of a sequence of instructions in a computer program. More specifically ILP
refers to the average number of instructions run per step of this parallel execution.

This is one of the important method to implement High Performance Computing.

Example: Tflops, Gflops

105
Instruction level Parallelism University College of Engg,
Osmania University

Approaches to instruction-level parallelism:


1. Hardware Level
2. Software Level

• Hardware level works upon dynamic parallelism


• Software level works on static parallelism.
• Dynamic parallelism means the processor decides at run time which instructions to execute in parallel
• Static parallelism means the compiler decides which instructions to execute in parallel The Pentium processor works on the
dynamic sequence of parallel execution, but the Itanium processor works on the static level parallelism Consider the following
program:
X = a + b
Y = c + d
Z = X*Y
If we assume that each operation can be completed in one unit of time then these three instructions can be
completed in a total of two units of time, giving an ILP of 3/2. 106
UNIT I - Instruction level Parallelism University College of Engg,
Osmania University

1. Static Scheduling
2. Dynamically Scheduling
Caution !!!!
Pipeline processing has the work of
breaking down instruction execution
into stages, where as ILP focuses on
executing the multiple instructions at
the same time.

107
Instruction level Parallelism University College of Engg,
Osmania University

Data Dependencies and Hazards


1. How much parallelism exists in a program and how it can be exploited
2. If two instructions are parallel, they can execute simultaneously in a pipeline without causing
any stalls (assuming no structural hazards exist)
3. There are no dependencies in parallel instructions
4. If two instructions are not parallel and must be executed in order, they may often be partially
overlapped.
Pipeline Hazards
1. Some instructions in the pipeline are allowed to proceed while others are delayed
2. For this example pipeline approach, when an instruction is stalled, all instructions further back in
the pipeline are also stalled
3. No new instructions are fetched during the stall
4. Instructions issued earlier in the pipeline must continue

108
Instruction level Parallelism University College of Engg,
Osmania University

Data Dependencies and Hazards- Example

LOOP: L.D F0,0(R1) ;F0=array element


ADD.D F4,F0,F2 ;add scalar in F2
S.D F4,0(R1) ;store result
DADDUI R1,R1,#-8 ;decrement pointer 8
BNE R1,R2,LOOP;
The above dependencies are in floating point data for the first two actions and integer
data in the last two instructions

109
Data Hazards: RAW, WAW and WAR Hazards University College of Engg,
Osmania University

Consider two instructions i and j. Let us assume i instruction is occurring before j instruction

RAW (Read After Write):


j tries to read a source before i writes it, so j incorrectly gets the old value. This is the most common
type of hazard and the kind that we use forwarding to overcome. To avoid this, program sequence to be
preserved

WAW (Write After Write):


j tries to write an operand before it is written by i. This results in wring order of the data

WAR (Write After Read) :


j tries to write the destination before it is read by i. SO i gets incorrect new value

110
Dynamic Scheduling- OOO University College of Engg,
Osmania University

We split the ID stage into two parts

• Issue Decode and check structural hazards


• Read operands Wait until no data hazard and read operands

An instruction fetch stage precedes the ID stage and may fetch in a single-entry latch, or a queue.
After that the EX stage

All instructions pass through the issue stage in order.


Scoreboard was first used in CDC 6600
Have to check for WAW and RAW hazards
Assume we have one integer unit, two multipliers, one adder, and one divide unit.
Scoreboard keep information about every instruction from fetch to execute The scoreboard controls when an
instruction can read operands, start execution and when it can write its result.
There is no forwarding. °Notice that there is no specific stage for write back, which means the operands can
be written
111
Name Dependences; Two Categories University College of Engg,
Osmania University

Two instructions use the same register or memory location, called a name, but
there is actually no flow of data between the instructions associated with that
name. In cases where i precedes j.
• 1. An antidependence between instructions i and j occurs when instruction j writes a
register or memory location that instruction i reads. The original ordering must be
preserved
• 2. An output dependence occurs when instruction i and instruction j write the same
register or memory location, the order again must be preserved

January 2013 Instruction Level Parallelism 112


Name Dependences; Two Categories University College of Engg,
Osmania University

1. An antidependence
•i DADD R1,R2.#-8
•j DADD R2,R5,0
2. An output dependence
•i DADD R1,R2.#-8
•j DADD R1,R4,#10

January 2013 Instruction Level Parallelism 113


Name Dependences University College of Engg,
Osmania University

Not true data dependencies, and therefore we could execute them


simultaneously or reorder them if the name (register or memory location) used in
the instructions is changed so that the instructions do not conflict
Register renaming is easier
•i DADD R1,R2,#-8
•j DADD R2,R4,#10

i DADD R1,R2,#-8
•j DADD R5,R4,#10

January 2013 Instruction Level Parallelism 114


Data Hazards University College of Engg,
Osmania University

A hazard is created whenever there is a dependence between instructions, and


they are close enough that the overlap caused by pipelining or other reordering
of instructions would change the order of access to the operand involved in the
dependence.
We must preserve program order; the order the instructions would execute if
executed in a non-pipelined system
However, program order only need be maintained where it affects the outcome
of the program

January 2013 Instruction Level Parallelism 115


Data Hazards – Three Types University College of Engg,
Osmania University

Two instructions i and j, with i occurring before j in program order, possible


hazards are:
• RAW (read after write) – j tries to read a source before i writes it, so j incorrectly gets the
old value
– The most common type
– Program order must be preserved
– In a simple common static pipeline a load instruction followed by an integer ALU
instruction that directly uses the load result will lead to a RAW hazard

January 2013 Instruction Level Parallelism 116


Data Hazards – Three Types University College of Engg,
Osmania University

Second type:
• WAW (write after write) – j tries to write an operand before it is written by i, with the
writes ending up in the wrong order, leaving value written by i
– Output dependence
– Present in pipelines that write in more than one pipe or allow an instruction to proceed
even when a previous instruction is stalled
– In the classic example, WB stage is used for write back, this class of hazards avoided.
– If reordering of instructions is allowed this is a possible hazard
– Suppose an integer instruction writes to a register after a floating point instruction
does

January 2013 Instruction Level Parallelism 117


Data Hazards – Three Types University College of Engg,
Osmania University

Third type:
• WAR (write after read) – j tries to write an operand before it is read by i, so i incorrectly
gets the new value.
– Antidependence
– Cannot occur in most static pipelines – note that reads are early in ID and writes late
in WB

January 2013 Instruction Level Parallelism 118


Control Dependencies University College of Engg,
Osmania University

Determines ordering of instruction, i with respect to a branch instruction so that


the instruction i is executed in the correct program order and only when it should
be.
Example
• if p1 {
S1;
};
if p2 {
S2;
}

January 2013 Instruction Level Parallelism 119


Control Dependencies University College of Engg,
Osmania University

Example
• if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1 and S2 is control dependent on P2 but not on P1

January 2013 Instruction Level Parallelism 120


Control Dependencies University College of Engg,
Osmania University

Two constraints imposed


• An instruction that is control dependent on a branch cannot be moved before
the branch so that its execution is no longer controlled by the branch. For
example we cannot take a statement from the then portion of an if statement
and move it before the if statement.
• An instruction that is not control dependent on a branch cannot be moved after
the branch so that the execution is controlled by the branch. For example, we
cannot take a statement before the if and move it into the then portion

if p1 {
S1;
};
if p2 {
S2;
}

January 2013 Instruction Level Parallelism 121


Control Dependencies University College of Engg,
Osmania University

Two properties of our simple pipeline preserve control dependencies


• Instructions execute in program order
• Detection of control or branch hazards ensures that an instruction that is control
dependent on a branch is not executed until the branch direction is known
We can introduce instructions that should not have been executed (violating
control dependences) if we can do so without affecting the correctness of the
program

January 2013 Instruction Level Parallelism 122


University College of Engg,
Osmania University

Conditions of
Parallelism

123
Conditions of Parallelism University College of Engg,
Osmania University

In order to move parallel processing into the mainstream of computing, following key
areas are important to consider

1. Computation models for parallel computing,


2. Interprocessor communication in parallel architectures,
3. System integration for incorporating parallel systems into general computing
environments.

The ability to execute several program segments in parallel requires each segment to be
independent of the other segments.

124
University College of Engg,
Osmania University
Data and Resource Dependencies

Program segments cannot be executed in parallel unless they


are independent. Independence comes in several forms:
1. Data dependence: data modified by one segment must not be modified by another parallel
segment.

2. Control dependence: if the control flow of segments cannot be identified before run time,
then the data dependence between the segments is variable.

3. Resource dependence: even if several segments are independent in other ways, they
cannot be executed in parallel if there aren’t sufficient processing resources (e.g. functional
units).
Data Dependence University College of Engg,
Osmania University

• Flow Dependence
• Anti Dependence
• Output Dependence
• I/O Dependence
• Unknown Dependence
Sample Program University College of Engg,
Osmania University

S1: Load R1, A


S2: Add R2, R1
S3: Move Ra, R3
S4: Store B, R1
Dependencies in Data University College of Engg,
Osmania University

• S2 is flow-dependent on S1 because the variable A is passed via the register R1.

• S3 is anti dependent on S2 because of potential conflicts in register content in R1.

• S3 is output-dependent on S1 because they both modify the same register R1.


Dependence graph University College of Engg,
Osmania University

S1

S4
S2

S3
Data Dependencies University College of Engg,
Osmania University

S1: Read (4), A(I) Read array A from tape unit 4


S2: Rewind (4) Rewind tape unit 4
S3: Write (4), BI) Write array B into tape unit 4
S4: Rewind (4) Rewind tape unit 4

• The read/write statements, S1 and S3, are I/O-dependent on each other because they both
access the same file from tape unit 4. The above data dependence relations should not be
arbitrarily violated during program execution.

• Otherwise, erroneous results may be produced with changed program order.

• The order in which statements are executed in a program is often well defined.

• Repetitive runs should produce identical results.


Control Dependence University College of Engg,
Osmania University

Control dependency arise when the order of execution of statements cannot be


determined before runtime.
For example , conditional statements (IF) will not be resolved until runtime.
Different paths taken after a conditional branch may introduce or eliminate data
dependence among instruction.
Dependence may also exits between operations performed in successive iteration
of looping procedures.
Control Dependence University College of Engg,
Osmania University

Control Dependence : This refers to the situation where the order of execution of statements cannot be determined
before run time.

• In Fortran, C languages programs , conditional statements (IF ) will not be resolved until run time.

• Different paths taken after à conditional branch may introduce or eliminate data dependence among instructions.

• Dependence may also exist between operations performed in successive iterations of a looping procedure.

• In the following, we show one loop example with and another without control-dependent iterations.

• The successive iterations of the following loop are control-independent:

Do 20 I = 1, N
A (I) = C(I)
IF (AI) .LT. 0) A(I) = 1

` 20 Continue
Do 10 I= 1, N
IF ( A(I-1). EQ.0 ) A(I)= 0
10 Continue
Control Dependence University College of Engg,
Osmania University

The successive iteration of the following loop are control-independent. example:


for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
Compiler techniques are needed to get around control dependence limitations.
Resource Dependence University College of Engg,
Osmania University

Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using
• shared resources
• such as registers
• integer and floating point
• ALUs
• ALU conflicts are called ALU dependence.
• Memory (storage) conflicts are called storage dependence.
University College of Engg,
Osmania University

Scalability Analysis
and approaches

135
University College of Engg,
Osmania University

Scalability:

Scalability is defined as the performance of a computer


system increases linearly with respect to the number of
processors and other resources used for a given
computer application.

136
Scalability Metrics University College of Engg,
Osmania University

Scalability metrics Identified below are the basic metrics affecting the scalability of a computer system for a given application:

• Machine size (n) - the number of processors employed in a parallel computer system. A large machine size
implies more resources and more computing power.

• Clock rate ( f) - the clock rate determines the basic machine cycle. We hope to build a machine with components
(processors, memory, bus or network, etc.) driven by a clock which can scale up with better technology.

• Problem size (s) - the amount of computational workload or the number of data points used to solve a given
problem. The problem size is directly proportional to the sequential execution time T(s, 1) for a uniprocessor
system because each data point may demand one or more operations.

• CPU time (T) — the actual CPU time (in seconds) elapsed in executing a given program on a parallel machine
with n processors collectively. This is the parallel execution time, denoted as I(s,n) and is a function of both s and
n.

137
Scalability Metrics University College of Engg,
Osmania University

• I/O demand (d) - the input/output demand in moving the program, data, and results associated with a given application
run. The I/O operations may overlap with the CPU operations in a multi programmed environment.

• Memory capacity (m) - the amount of main memory (in bytes or words) used in a program execution. Note that the
memory demand is affected by the problem size, the program size, the algorithms, and the data structures used.

The memory demand varies dynamically during program execution. Here, we refer to the maximum number of
memory words demanded. Virtual memory is almost unlimited with a 64-bit address space. It is the physical memory
which is limited in capacity.

• Communication overhead (h) - the amount of time spent for inter-processor communication, synchronization, remote
memory access, etc. This overhead includes all non compute operations which do not involve the CPUs or I/0 devices.
This overhead h(s, n) is a function of s and n and is not part of T(s,n). For a uniprocessor system, the
overhead h(s, 1) = 0.

• Computer cost (e): the total cost of hardware and software resources required

• Programming overhead (p): The development overhead associated with an application program. The programming
overhead may slow down the software productivity and which causes higher cost and resources.
138

You might also like