Unit I - BE v-ACA - 2024 - Lecture Notes
Unit I - BE v-ACA - 2024 - Lecture Notes
Osmania University
BE - Computer Science & Engineering
1
Advanced Computer Architecture - Syllabus University College of Engg,
Osmania University
3
University College of Engg,
Osmania University
UNIT-I
4
University College of Engg,
Osmania University
Parallel Processing
overview
5
Computer - Basics University College of Engg,
Osmania University
Application Software
Operating System
Firmware/BIOS
Hardware
6
High Level Arch University College of Engg,
Osmania University
7
High Performance Computing University College of Engg,
Osmania University
High-performance computing (HPC) is the practice of using parallel data processing to improve computing
performance and perform complex calculations. HPC achieves these goals by aggregating computing power,
so even advanced applications can run efficiently, reliably and quickly as per user needs and expectations. It
thus delivers much higher power and better performance than traditional computers, workstations and servers
8
Parallel Processing University College of Engg,
Osmania University
❑ Parallel processing is a term used to denote a large class of techniques that are used to provide simultaneous data
processing tasks for the purpose of increasing the computational speed of a computer system.
❑ Instead of processing each instruction sequentially as in a conventional computer, a parallel process system is able to
perform concurrent data processing to achieve faster execution time.
❑ For example, while an instruction is being executed in the ALU, the next instruction can be read from memory. The
system may have two or more ALUs and be able to execute two or more instructions at the same time. Furthermore,
the system may have two or more processor operating concurrently.
9
Why Parallel Processors? University College of Engg,
Osmania University
Parallel processors :
It is defined as computer systems consisting of multiple processing units connected via some interconnection network plus the software
needed to make the processing units work together.
➢ The processing units can communicate and interact with each other using either shared memory or message passing
methods. The interconnection network for shared memory systems can be classified as bus-based versus switch-based.
➢ In message passing systems, the interconnection network is divided into static and dynamic. Static connections have a fixed
topology that does not change while programs are running. Dynamic connections create links on the fly as the program
executes.
➢ The main Objective: is to create powerful computers by simply connecting multiple processors. A multiprocessor is expected
to reach faster speed than the fastest single-processor system. In addition, a multiprocessor consisting of a number of single
processors is expected to be more cost-effective than building a high-performance single processor.
➢ If a processor fails, the remaining processors should be able to provide continued service, albeit with degraded performance.
10
Why Parallel Processors? University College of Engg,
Osmania University
11
University College of Engg,
Osmania University
Parallel Computer
Models
12
University College of Engg,
Classification of Parallel Computers Osmania University
The most universally excepted method of classifying computer systems is as per Flynn’s
taxonomy the classification
Flynn’s classification scheme is based on the notion of a stream of information. Two types of information flow into a
processor: Instructions and Data.
➢ The instruction stream is defined as the sequence of instructions performed by the processing unit.
➢ The data stream is defined as the data traffic exchanged between the memory and the processing unit.
➢ According to Flynn’s classification, either of the instruction or data streams can be single or multiple.
As per Fynn’s methodology, Computer architecture can be classified into the following 4 distinct categories:
13
Parallel Processing- Classification of Computers University College of Engg,
Osmania University
Architectural Classification
– Flynn's classification
» Based on the multiplicity of Instruction Streams and
Data Streams
» Instruction Stream
• Sequence of Instructions read from memory
» Data Stream
• Operations performed on the data in the processor
14
Parallel Processing- Classification University College of Engg,
Osmania University
SISD:
❑ SISD represents the organization of a single computer containing a control unit, a processor unit, and a memory unit.
❑ Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities.
❑ Parallel processing in this case may be achieved by means of multiple functional units or by pipeline processing.
SIMD :
❑ SIMD represents an organization that includes many processing units under the supervision of a common control unit.
❑ All processors receive the same instruction from the control unit but operate on different items of data.
❑ The shared memory unit must contain multiple modules so that it can communicate with all the processors
simultaneously.
MISD:
MISD structure is only of theoretical interest since no practical system has been constructed using this organization.
MIMD :
❑ MIMD organization refers to a computer system capable of processing several programs at the same time.
❑ Most multiprocessor and multi computer systems can be classified in this category.
15
University College of Engg,
Osmania University
Instructions
Processing Main memory
element (PE) (M)
Data
IS
IS DS
Control Unit PE Memory
16
University College of Engg,
SIMD: Single Instruction Multiple Data Streams Osmania University
Fine-grained
• Image processing application
• Large number of PEs
• Minimum complexity PEs
• Programming language is a simple extension of a sequential
language
Coarse-grained
• Each PE is of higher complexity and it is usually built with
commercial devices
• Each PE has local memory
Applications:
• Image processing
• Matrix manipulations
• Sorting
17
MIMD
University College of Engg,
Osmania University
18
MISD
University College of Engg,
Osmania University
Applications:
• Classification
• Robot vision
19
Flynn taxonomy
University College of Engg,
Osmania University
– Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system (?)
– Disadvantages of Flynn
» Very coarse-grain differentiation among machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not considered in the scheme
20
Further classification University College of Engg,
Osmania University
21
Further classification University College of Engg,
Osmania University
22
University College of Engg,
Osmania University
1. Classification based on memory arrangement
Shared memory
I/O1 Interconnection
Interconnection network
network
I/On
PE1 PEn
PE1 PEn M1 Mn
Processors P1 Pn
23
University College of Engg,
Shared-memory multiprocessors Osmania University
24
The UMA Model University College of Engg,
Osmania University
25
The UMA Model University College of Engg,
Osmania University
26
THE NUMA MODEL University College of Engg,
Osmania University
• Access time varies with location of memory
• Shared memory is distributed to all processors – Local memories
• Collection of all local memories forms global memory space accessible by all
processors.
• Its faster to access content within local memory of a processor than. to
access remote memory attached to another processor (delay through
interconnection)
• Thus named as NON-Uniform Memory access – because access time
depends on whether data available in local memory of the processor
itself or not.
27
Cache-only Memory Architecture (COMA) University College of Engg,
Osmania University
• Application of COMA
28
University College of Engg,
Osmania University
Symmetric and asymmetric multiprocessors
Symmetric:
- all processors have equal access to all peripheral devices.
- all processors are identical.
Asymmetric:
- one processor (master) executes the operating system
- other processors may be of different types and may be dedicated to special tasks.
29
Further classification University College of Engg,
Osmania University
30
University College of Engg,
Osmania University
Static networks
Dynamic networks
32
Further classification University College of Engg,
Osmania University
33
Unit I- Performance Metrics and Measures University College of Engg,
Osmania University
34
Unit I- Performance Metrics and Measures University College of Engg,
Osmania University
35
Program Execution by a CPU- Flow Chart University College of Engg,
Osmania University
Example:
A program is written and the details are as follows:
5 Memory instructions
10 Addition instructions
5 Square root instructions
Weighted CPI
n
CPU time =( (CPIi * Ii))/clock rate
i=1
38
University College of Engg,
Performance Laws Osmania University
40
Further classification University College of Engg,
Osmania University
Amdahl’s Law
41
Amdahl's law- background University College of Engg,
Osmania University
In the mid-1900s, digital computers were the bleeding edge of technology. It was around this
time, in 1952, that Gene Amdahl graduated from the University of Wisconsin and joined IBM.
Amdahl had already made a splash at UW by helping design and build the WISC, the Wisconsin
Integrally Synchronized Computer, which was the first digital computer to be built in the state.
Amdahl's law, or argument, was first presented in 1967 at the AFIPS Joint Computer
Conference. During the presentation, Amdahl showed how the performance improvement from
parallel processing for a sequence of operations was limited.
Even if certain operations could be sped-up by being performed in parallel, other operations that
could not, such as reading or writing data, would limit how fast the system could be improved.
Today, the law has been used to demonstrate that the components that make the larger
proportion of the system are the best to focus on for improvements and parallelization.
42
Parallel Processing Performance Laws University College of Engg,
Osmania University
Amdahl's law was proposed in the 1950s by Gene Amdahl. The Law says “Components
of a system that cannot be improved will limit the overall possible improvement of the
system, such as from parallel processing.”
43
University College of Engg,
Osmania University
The theoretical speedup of the latency (via a reduction of latency, ie: latency as a metric is elapsed time between an input and
output in a system) of the execution of a program as a function of the number of processors executing it, according to
Amdahl's law. The speedup is limited by the serial part of the program. For example, if 95% of the program can be
parallelized, the theoretical maximum speedup using parallel computing would be 20 times.
44
Example University College of Engg,
Osmania University
20 hours
A B
must walk 200 miles
45
Example University College of Engg,
Osmania University
20 hours
A B
must walk 200 miles
46
Speedup - Derivation University College of Engg,
Osmania University
47
University College of Engg,
Osmania University
48
University College of Engg,
Osmania University
49
University College of Engg,
Osmania University
50
University College of Engg,
Osmania University
Problem:
In an enhancement of a design of a CPU, the speed of a floating point unit has been increased by 20%
and the fixed point unit has been increased by 10%. What is overall speedup achieved if the ratio of
the no. of floating point operations to no. of fixed point operation is 2:3 and floating point operation
used to take twice the time taken by the fixed point operation in original design.
Solution:
51
Problem: University College of Engg,
Osmania University
University College of Engg,
Osmania University
Problem:
53
Further classification University College of Engg,
Osmania University
Gustafson’s Law
54
University College of Engg,
Osmania University
55
University College of Engg,
Osmania University
• The execution time of a program running on a parallel system can be split into two parts:
• a part that does not benefit from the increasing number of processors (serial part);
• a part that benefits from the increasing number of processors (parallel part).
Example. — A computer program that processes files from disk. A part of that program may scan the
directory of the disk and create a list of files internally in memory. After that, another part of the program
passes each file to a separate thread for processing. The part that scans the directory and creates the file
list cannot be sped up on a parallel computer, but the part that processes the files can.
• Without loss of generality, let the total execution time on the parallel system be
• T =1
• Denote the serial time as s and the parallel time as p
• where
• s+p=1
• Denote the number of processors as N
56
University College of Engg,
Gustafson’s Law Osmania University
T1 = s+ Np ( no parallel)
T2= s+p ( parallel
Speed up= T1/T2
= ( s+Np) / T2
Where
T2 = 1
• s serial portion processing time
Therefore= Speed up = s+Np
• p – parallel processor time
p= 1- s
• N- Number of parallel processors
= s+ N ( 1-s)
= s+ N-NS
S= N - s( N-1)
57
University College of Engg,
Osmania University
Key Note:
58
University College of Engg,
Osmania University
Problem 1:
59
University College of Engg,
Osmania University
Problem 2:
If 20% of the program is sequential , calculate the speedup on a system with 16 processors.
Solution:
s=0.20
p=16
60
Further classification University College of Engg,
Osmania University
61
University College of Engg,
Osmania University
This law defines a memory bounded speedup model which generalizes both Amdahl’s Law
and Gustafson’s Law to maximize the use of both processor and memory capacities.
The idea is to solve maximum possible size of problem, limited by memory capacity
This inherently demands an increased or scaled work load, providing higher speedup,
Higher efficiency, and better resource (processor & memory) utilization. But may result in
slight increase in execution time to achieve this scalable speedup performance!
Speedup University College of Engg,
Performance Laws Osmania University
According to this law, the speedup S*(n) in the performance can be defined by:
Multiprocessors and
Multicomputers
65
University College of Engg,
Multiprocessors and Multicomputer - Summary Osmania University
Multiprocessors Multicomputer
1. Single computer with multiple processors 1. Multiple autonomous computers
2. Each PE’s(CPU/processing elements) do not have 2. Each PE’s has its own memory and resources – no
their own individual memories – memory and I/O sharing – Thus called Distributed Memory
resources are shared – Thus called Shared Memory Multicomputers
Multiprocessors 3. Communication between PE’s not mandatory
3. Communication between PE’s a must 4. Loosely coupled as there is no resource sharing
4. Tightly coupled – due to high degree of resource 5. Use Static Network – connection of switching units is
sharing fixed
5. Use Dynamic Network – thus communication links 6. NORMA model/ Distributed- Memory Multicomputer
can be reconfigured
6. 3 Types: UMA model, NUMA model, COMA model
66
3. Classification based on the kind of parallelism University College of Engg,
Osmania University
Parallel architectures
PAs
DPs
ILPS MIMDs
68
Parallel Computer Models University College of Engg,
Osmania University
• Modern Multicomputers use hardware routers to pass messages. A computer node is attached to each router. The
boundary router may be connected to I/O and peripheral devices.
• Message passing between any two nodes involves a sequence of routers and channels.
• Mixed types of nodes are allowed in a heterogeneous multicomputer. The inter node communication in a heterogeneous
multicomputer is achieved through compatible data representations and message-passing protocols.
• Commonly used topologies include the ring, tree. Mesh, hypercube etc.
• Various communication patterns are demanded among the nodes, such as one-to-one, broadcasting, permutations, and
multicast patterns
69
University College of Engg,
Osmania University
Multi-vector and
SIMD Computers
70
SIMD computers University College of Engg,
Osmania University
SIMD computers
71
SIMD Computer University College of Engg,
Osmania University
• Such machines exploit data level parallelism, but not concurrency: there are
simultaneous (parallel) computations, but each unit performs the exact same
instruction at any given moment (just with different data).
• Most SIMD computers use a single control unit and distributed memories
• The instruction set of an SIMD computer is decoded by the array control unit.
• The processing elements (PEs) in the SIMD array are passive ALUs executing instructions broadcast from the control unit.
• All PEs must operate in lockstep, synchronized by the same array controller.
72
SIMD Computers University College of Engg,
Osmania University
• SIMD computer have One Control Processor and several Processing Elements.
• All Processing Elements execute the same instruction at the same time.
• SIMD computer ( array processor) is normally interfaced to a host computer through the
control unit. The host computer is a general purpose machine which serves as the
operating manager of the entire system.
• Each PEi is essentially an ALU with attached working registers and local memory PEMi
for the storage of distributed data. The CU also has its own main memory for storage of
programs. The function of CU is to decode all instruction and determine where the
decoded instructions should be executed. Scalar or control type instructions are directly
executed inside the CU. Vector instructions are broadcasted to the PEs for distributed
execution.
• Masking schemes are used to control the status of each PE during the execution of a
vector instruction.
• Each PE may be either active or disabled during an instruction cycle. A masking vector is
used to control the status of all PEs. Only enabled PEs perform computation. Data
exchanges among the PEs are done via an inter-PE communication network, which
performs all necessary data routing and manipulation functions. This interconnection
network is under the control of the control unit.
73
SIMD Computers University College of Engg,
Osmania University
Where:
1. N is the number of processing elements (PEs) in the machine. For example, the Iliac IV has 64 PEs
and the Connection Machine CM-2 uses 65,536 PEs.
2. C is the set of instructions directly executed by the control unit (CU), including scalar and program
flow control instructions.
3. I is the set of instructions broadcast by the CU to all PEs for parallel execution. These include
arithmetic, logic, data routing, masking, and other local operations executed by each active PE over
data within that PE.
4. M is the set of masking schemes, where each mask partitions the set of PEs into enabled and
disabled subsets.
5. R is the set of data-routing functions, specifying various patterns to be set up in the interconnection
network for inter-PE communications.
74
SIMD Computers University College of Engg,
Osmania University
• A distributed-memory SIMD computer consists of an array of PEs which are
controlled by the same array control unit
• Program and data are loaded into the control memory through the host • Distributed-Memory Model :
computer.
• An instruction is sent to the control unit for decoding. If it is a scalar or
program control operation, it will be directly executed by a scalar processor
attached to the control unit.
• If the decoded instruction is a vector operation, it will be broadcast to all the
PEs for parallel execution.
• Partitioned data sets are distributed to all the local memories attached to the
PEs through a vector data bus.
• The PEs are interconnected by a data-routing network which performs inter-
PE data communications such as shifting, permutation, and other routing
operations.
• The data-routing network is under program control through the control unit.
• The PEs are synchronized in hardware by the control unit.
• In other words, the same instruction is executed by all the PEs in the same
cycle.
• However, masking logic is provided to enable or disable any PE from
participation in a given instruction cycle.
75
SIMD Computers University College of Engg,
Osmania University
76
Multi-vector computers University College of Engg,
Osmania University
Multi-vector computers
77
Multi-vector computers University College of Engg,
Osmania University
Vector Processing:
A vector is a set of scalar data items, all of the same type, stored in memory. Usually, the
vector elements are ordered to have a fixed addressing increment between successive
elements called the stride.
78
Vector Processors University College of Engg,
Osmania University
Vector Processors
Vector processors issue one instructions that operate on multiple data items (arrays). This is conducive to pipelining
with one result produced per cycle.
A vector processor is a coprocessor designed to perform vector computations. A vector is a one-dimensional array of
data items (each of the same data type). Vector processors are often used in multipipelined supercomputers.
Architectural types include:
• register-to-register (with shorter instructions and register files)
• memory-to -memory (longer instructions with memory addresses)
79
Multi Vector computer University College of Engg,
Osmania University
80
Multi Vector computer University College of Engg,
Osmania University
VECTOR PROCESSOR MODELS
• The fig above shows a register to register architecture. Vector registers are used to hold vector operands, intermediate and
final vector results.
• Length of vector register is usually fixed (ex 64 bit for CRAY Series Supercomputer). Some machines use reconfigurable
vector registers to dynamically match register length(ex: Fujitsu VP2000)
• Generally there are fixed no: of vector registers and functional pipelines in vector processors, hence they must be reserved
in advance to avoid conflicts
81
University College of Engg,
Osmania University
Pipelining
82
Pipelining- Methodology University College of Engg,
Osmania University
Pipeline processing is an implementation technique where arithmetic suboperations or the phases of a computer
instruction cycle overlap in execution.
❑ Pipelining is a technique of decomposing a sequential process into suboperations, with each subprocess being executed in a special
dedicated segment that operates concurrently with all other segments.
❑ A pipeline can be visualized as a collection of processing segments through which binary information flows.
❑ Each segment performs partial processing dictated by the way the task is partitioned.
❑ The result obtained from the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after
the data have passed through all segments.
❑ It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time. The overlapping of
computation is made possible by associating a register with each segment in the pipeline.
❑ The registers provide isolation between each segment so that each can operate on distinct data simultaneously.
❑ Perhaps the simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register followed by a
combinational circuit.
❑ The register holds the data and the combinational circuit performs the suboperation in the particular segment. The output of the
combinational circuit in a given segment is applied to the input register of the next segment. A clock is applied to all registers after enough
time has elapsed to perform al segment activity. In this way the information flows through the pipeline one step at a time.
83
Pipelining- Example University College of Engg,
Osmania University
84
University College of Engg,
Osmania University
Arithmetic Pipelining
85
University College of Engg,
Segment pipeline Osmania University
❑ Any operation that can be decomposed into a sequence of General Structure of a 4-Segment Pipeline
suboperations of about the same complexity can be implemented by
Clock
a pipeline processor.
86
University College of Engg,
Osmania University
87
University College of Engg,
Osmania University
• An algorithm is a sequence of steps that take inputs from the user and after
some computation, produces an output.
Exclusive Read Exclusive Write (EREW) − Here no two processors are allowed to read from
or write to the same memory location at the same time.
Exclusive Read Concurrent Write (ERCW) − Here no two processors are allowed to read
from the same memory location at the same time, but are allowed to write to the same memory
location at the same time.
Concurrent Read Exclusive Write (CREW) − Here all the processors are allowed to read from
the same memory location at the same time, but are not allowed to write to the same memory
location at the same time.
Concurrent Read Concurrent Write (CRCW) − All the processors are allowed to read from or
write to the same memory location at the same time.
PRAM model approaches University College of Engg,
Osmania University
Thread libraries − The thread library allows multiple threads of control that run concurrently in the same memory
location. Thread library provides an interface that supports multithreading through a library of subroutine.
Distributed Shared Memory (DSM) Systems − DSM systems create an abstraction of shared memory on loosely
coupled architecture in order to implement shared memory programming without hardware support.
Program Annotation Packages − This is implemented on the architectures having uniform memory access
characteristics.
Merit and Demerits of shared memory University College of Engg,
Osmania University
Merits
•Global address space gives a user-friendly programming approach to memory.
•Due to the closeness of memory to CPU, data sharing among processes is fast and
uniform.
•There is no need to specify distinctly the communication of data among processes.
•Process-communication overhead is negligible.
•It is very easy to learn.
Demerits
•It is not portable.
•Managing data locality is very difficult.
Message Passing PRAM Model University College of Engg,
Osmania University
• Message passing is the most commonly used parallel programming approach in distributed memory
systems.
• Here, the programmer has to determine the parallelism.
• In this model, all the processors have their own local memory unit and they exchange data through a
communication network.
Processor Communication University College of Engg,
Osmania University
1. Point-to-Point Communication
2. Collective Communication
3. Message Passing Interface
4. Processors use message-passing libraries for communication , which contain
–The address of the processor from which the message is being sent;
–Starting address of the memory location of the data in the sending processor;
–Data type and size of the sending data;
–The address of the processor to which the message is being sent;
–Starting address of the memory location for the data in the receiving processor.
Merit & Demerit of Message Passing University College of Engg,
Osmania University
Merits
•Provides low-level control of parallelism;
•It is portable;
•Less error prone;
•Less overhead in parallel synchronization and data distribution.
Demerits
•As compared to parallel shared-memory code, message-passing code
generally needs more software overhead.
Data Parallel Programming University College of Engg,
Osmania University
Instruction Pipelining
102
University College of Engg,
Osmania University
103
Pipeline Conflicts University College of Engg,
Osmania University
In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal
operation.
❑ Resource conflicts caused by access to memory by two segments at the same time. Most of these conflicts
can be resolved by using separate instruction and data memories.
❑ Data dependency conflicts arise when an instruction depends on the result of a previous instruction, but
this result is not yet available.
❑ Branch difficulties arise from branch and other instructions that change the value of PC.
104
UNIT I - Instruction level Parallelism University College of Engg,
Osmania University
105
Instruction level Parallelism University College of Engg,
Osmania University
1. Static Scheduling
2. Dynamically Scheduling
Caution !!!!
Pipeline processing has the work of
breaking down instruction execution
into stages, where as ILP focuses on
executing the multiple instructions at
the same time.
107
Instruction level Parallelism University College of Engg,
Osmania University
108
Instruction level Parallelism University College of Engg,
Osmania University
109
Data Hazards: RAW, WAW and WAR Hazards University College of Engg,
Osmania University
Consider two instructions i and j. Let us assume i instruction is occurring before j instruction
110
Dynamic Scheduling- OOO University College of Engg,
Osmania University
An instruction fetch stage precedes the ID stage and may fetch in a single-entry latch, or a queue.
After that the EX stage
Two instructions use the same register or memory location, called a name, but
there is actually no flow of data between the instructions associated with that
name. In cases where i precedes j.
• 1. An antidependence between instructions i and j occurs when instruction j writes a
register or memory location that instruction i reads. The original ordering must be
preserved
• 2. An output dependence occurs when instruction i and instruction j write the same
register or memory location, the order again must be preserved
1. An antidependence
•i DADD R1,R2.#-8
•j DADD R2,R5,0
2. An output dependence
•i DADD R1,R2.#-8
•j DADD R1,R4,#10
i DADD R1,R2,#-8
•j DADD R5,R4,#10
Second type:
• WAW (write after write) – j tries to write an operand before it is written by i, with the
writes ending up in the wrong order, leaving value written by i
– Output dependence
– Present in pipelines that write in more than one pipe or allow an instruction to proceed
even when a previous instruction is stalled
– In the classic example, WB stage is used for write back, this class of hazards avoided.
– If reordering of instructions is allowed this is a possible hazard
– Suppose an integer instruction writes to a register after a floating point instruction
does
Third type:
• WAR (write after read) – j tries to write an operand before it is read by i, so i incorrectly
gets the new value.
– Antidependence
– Cannot occur in most static pipelines – note that reads are early in ID and writes late
in WB
Example
• if p1 {
S1;
};
if p2 {
S2;
}
S1 is control dependent on p1 and S2 is control dependent on P2 but not on P1
if p1 {
S1;
};
if p2 {
S2;
}
Conditions of
Parallelism
123
Conditions of Parallelism University College of Engg,
Osmania University
In order to move parallel processing into the mainstream of computing, following key
areas are important to consider
The ability to execute several program segments in parallel requires each segment to be
independent of the other segments.
124
University College of Engg,
Osmania University
Data and Resource Dependencies
2. Control dependence: if the control flow of segments cannot be identified before run time,
then the data dependence between the segments is variable.
3. Resource dependence: even if several segments are independent in other ways, they
cannot be executed in parallel if there aren’t sufficient processing resources (e.g. functional
units).
Data Dependence University College of Engg,
Osmania University
• Flow Dependence
• Anti Dependence
• Output Dependence
• I/O Dependence
• Unknown Dependence
Sample Program University College of Engg,
Osmania University
S1
S4
S2
S3
Data Dependencies University College of Engg,
Osmania University
• The read/write statements, S1 and S3, are I/O-dependent on each other because they both
access the same file from tape unit 4. The above data dependence relations should not be
arbitrarily violated during program execution.
• The order in which statements are executed in a program is often well defined.
Control Dependence : This refers to the situation where the order of execution of statements cannot be determined
before run time.
• In Fortran, C languages programs , conditional statements (IF ) will not be resolved until run time.
• Different paths taken after à conditional branch may introduce or eliminate data dependence among instructions.
• Dependence may also exist between operations performed in successive iterations of a looping procedure.
• In the following, we show one loop example with and another without control-dependent iterations.
Do 20 I = 1, N
A (I) = C(I)
IF (AI) .LT. 0) A(I) = 1
` 20 Continue
Do 10 I= 1, N
IF ( A(I-1). EQ.0 ) A(I)= 0
10 Continue
Control Dependence University College of Engg,
Osmania University
Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using
• shared resources
• such as registers
• integer and floating point
• ALUs
• ALU conflicts are called ALU dependence.
• Memory (storage) conflicts are called storage dependence.
University College of Engg,
Osmania University
Scalability Analysis
and approaches
135
University College of Engg,
Osmania University
Scalability:
136
Scalability Metrics University College of Engg,
Osmania University
Scalability metrics Identified below are the basic metrics affecting the scalability of a computer system for a given application:
• Machine size (n) - the number of processors employed in a parallel computer system. A large machine size
implies more resources and more computing power.
• Clock rate ( f) - the clock rate determines the basic machine cycle. We hope to build a machine with components
(processors, memory, bus or network, etc.) driven by a clock which can scale up with better technology.
• Problem size (s) - the amount of computational workload or the number of data points used to solve a given
problem. The problem size is directly proportional to the sequential execution time T(s, 1) for a uniprocessor
system because each data point may demand one or more operations.
• CPU time (T) — the actual CPU time (in seconds) elapsed in executing a given program on a parallel machine
with n processors collectively. This is the parallel execution time, denoted as I(s,n) and is a function of both s and
n.
137
Scalability Metrics University College of Engg,
Osmania University
• I/O demand (d) - the input/output demand in moving the program, data, and results associated with a given application
run. The I/O operations may overlap with the CPU operations in a multi programmed environment.
• Memory capacity (m) - the amount of main memory (in bytes or words) used in a program execution. Note that the
memory demand is affected by the problem size, the program size, the algorithms, and the data structures used.
The memory demand varies dynamically during program execution. Here, we refer to the maximum number of
memory words demanded. Virtual memory is almost unlimited with a 64-bit address space. It is the physical memory
which is limited in capacity.
• Communication overhead (h) - the amount of time spent for inter-processor communication, synchronization, remote
memory access, etc. This overhead includes all non compute operations which do not involve the CPUs or I/0 devices.
This overhead h(s, n) is a function of s and n and is not part of T(s,n). For a uniprocessor system, the
overhead h(s, 1) = 0.
• Computer cost (e): the total cost of hardware and software resources required
• Programming overhead (p): The development overhead associated with an application program. The programming
overhead may slow down the software productivity and which causes higher cost and resources.
138