Downloaded from www.rgpvnotes.
in
Chameli Devi Group of Institutions
Department of Computer Science and Engineering
Subject Notes
Subject Code: CS-6001 Subject Name: Advanced Computer Architecture
UNIT-1
Flynn's Classification
Flynn’s classification distinguishes multi-processor computer architectures according to two independent
dimensions of Instruction stream and Data stream. An instruction stream is sequence of instructions
executed by machine. And a data stream is a sequence of data including input, partial or temporary results
used by instruction stream. Each of these dimensions can have only one of two possible states: Single or
Multiple. Flynn’s classification depends on the distinction between the performance of control unit and the
data processing unit rather than its operational and structural interconnections. Following are the four
category of Flynn classification and characteristic feature of each of them.
a) Single Instruction Stream, Single Data Stream (SISD)
The figure 1.1 is represents an organization of simple SISD computer having one control unit, one processor
unit and single memory unit.
Control Instruction Processor Data Memory
Stream Stream
Figure 1.1: SISD processor organizations
They are also called scalar processor i.e., one instruction at a time and each instruction have
only one set of operands.
Single instruction: only one instruction stream is being acted on by the CPU during any one
clock cycle.
Single data: only one data stream is being used as input during any one clock cycle.
Deterministic execution.
Instructions are executed sequentially.
This is the oldest and until recently, the most prevalent form of computer.
Examples: most PCs, single CPU workstations and mainframes.
b) Single Instruction Stream, Multiple Data Stream (SIMD) processors
A type of parallel computer.
Single instruction: All processing units execute the same instruction issued by the control unit at
any given clock cycle as shown in figure where there is multiple processors executing instruction
given by one control unit.
Multiple data: Each processing unit can operate on a different data element a s shown if figure
below the processor are connected to shared memory or interconnection network providing
multiple data to processing unit.
This type of machine typically has an instruction dispatcher, a very high- bandwidth internal
network, and a very large array of very small-capacity instruction units.
Thus single instruction is executed by different processing unit on different set of data as shown in
figure 1.2.
Best suited for specialized problems characterized by a high degree of regularity, such as image
processing and vector computation.
Synchronous (lockstep) and deterministic execution.
By - Aniket Sugandhi
Page no: 1 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Shared memory or interconnection network
Data
1 2 N
streams
Processor 1 Processor 2 Processor N
Instruction stream
Control
Figure 1.2: SIMD processor organizations
C) Multiple Instruction Stream, Single Data Stream (MISD)
A single data stream is feed into multiple processing units.
Each processing unit operates on the data independently via independent instruction streams
as shown in figure 1.3 a single data stream is forwarded to different processing unit which are
connected to different control unit and execute instruction given to it by control unit to which it is
attached.
Instruction
Data streams
stream
Processor 1 Control 1
1
Memory Processor 2 Control 2
2
Processor N Control N
N
Figure 1.3: MISD processor organizations
Thus in these computers same data flow through a linear array of processors executing different
instruction streams.
This architecture is also known as systolic arrays for pipelined execution of specific instructions.
Few actual examples of this class of parallel computer have ever existed. One is the experimental
Carnegie-Mellon C.mmp computer (1971).
d) Multiple Instruction Stream, Multiple Data Stream (MIMD)
• Multiple Instructions: Every Processor may be executing a different instruction stream.
Multiple Data: every processor may be working with a different data stream as shown in figure 1.4
multiple data stream is provided by shared memory.
Can be categorized as loosely coupled or tightly coupled depending on sharing of data and control.
Execution can be synchronous or asynchronous, deterministic or non-
deterministic.
• Examples: most current supercomputers, networked parallel computer "grids" and multi-processor
By - Aniket Sugandhi
Page no: 2 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
SMP computers - including some types of PCs.
Shared memory or interconnection network
Data 1 2 N
streams
Processor 1 Processor 2 Processor N
Instruction 1 N
2
streams
Control 1 Control 2 Control N
Figure 1.4: MIMD processor organizations
System Attributes to Performance
Performance of a system depends on
• Hardware technology
• Architectural features
• Efficient resource management
• Algorithm design
• Data structures
• Language efficiency
• Programmer skill
• Compiler technology
When we talk about performance of computer system we would describe how quickly a given system can
execute a program or programs. Thus we are interested in knowing the turnaround time. Turnaround time
depends on:
• Disk and memory accesses
• Input and output
• Compilation time
• Operating system overhead
• CPU time
An ideal performance of a computer system means a perfect match between the machine capability and
program behavior. The machine capability can be improved by using better hardware technology and
efficient resource management. But as far as program behavior is concerned it depends on code used,
compiler used and other run time conditions. Also a machine performance may vary from program to
program. Because there are too many programs and it is impractical to test a CPU's speed on all of them
benchmarks were developed. Computer architects have come up with a variety of metrics to describe the
computer performance.
Clock rate and CPI / IPC: Since I/O and system overhead frequently overlaps processing by other programs, it
is fair to consider only the CPU time used by a program, and the user CPU time is the most important
factor. CPU is driven by a clock with a constant cycle time (usually measured in nanoseconds, which controls
the rate of internal operations in the CPU. The clock mostly has the constant cycle time (t in nanoseconds).
The inverse of the cycle time is the clock rate (f = 1/τ, measured in megahertz). A shorter clock cycle time, or
equivalently a larger number of cycles per second, implies more operations can be performed per unit
time. The size of the program is determined by the instruction count (Ic). The size of a program is determined
by its instruction count, Ic, the number of machine instructions to be executed by the program. Different
machine instructions require different numbers of clock cycles to execute. CPI (cycles per instruction) is thus
By - Aniket Sugandhi
Page no: 3 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
an important parameter.
Average CPI
It is easy to determine the average number of cycles per instruction for a particular processor if we know the
frequency of occurrence of each instruction type. Any estimate is valid only for a specific set of programs
(which defines the instruction mix), and then only if there are sufficiently large number of instructions.
In general, the term CPI is used with respect to a particular instruction set and a given program mix. The time
required to execute a program containing Ic instructions is just T = Ic * CPI * τ.
Each instruction must be fetched from memory, decoded, then operands fetched from memory, the
instruction executed, and the results stored.
The time required to access memory is called the memory cycle time, which is usually k times the processor
cycle time τ. The value of k depends on the memory technology and the processor-memory interconnection
scheme. The processor cycles required for each instruction (CPI) can be attributed to cycles needed for
instruction decode and execution (p), and cycles needed for memory references (m* k).
The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*τ
Parallel Computer Models
Multiprocessor and Multicomputer
Two categories of parallel computers are discussed below namely shared common memory or unshared
distributed memory.
Shared Memory Multiprocessor
• Shared memory parallel computers vary widely, but generally have in common the ability for all
processors to access all memory as global address space.
• Multiple processors ca n operate independently but share the same memory resources.
• Changes in a memory location effected by one processor are visible to all other processors.
• Shared memory machines can be divided into two main classes based upon memory access times:
UMA, NUMA and COMA.
Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines.
• Identical processors.
• Equal access and access times to memory.
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates
a location in shared memory, all the other processors know about the update. Cache coherency is
accomplished at the hardware level.
CPU
CPU Memory CPU
CPU
Figure 1.5: Shared Memory (UMA)
Non-Uniform Memory Access (NUMA):
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
By - Aniket Sugandhi
Page no: 4 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
CPU CPU CPU CPU
Memory Memory
CPU CPU CPU CPU
Bus Interconnect
CPU CPU CPU CPU
Memory Memory
CPU CPU CPU CPU
Figure 1.6: Shared Memory (NUMA)
The COMA model (Cache only Memory Access): The COMA model is a special case of NUMA machine in
which the distributed main memories are converted to caches. All caches form a global address space and
there is no memory hierarchy at each processor node.
Advantages:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can
geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure "correct" access of global
memory.
• Expense: it becomes increasingly difficult and expensive to design and produce shared memory
machines with ever increasing numbers of processors.
Distributed Memory
• Like shared memory systems, distributed memory systems vary widely but share a common
characteristic. Distributed memory systems require a communication network to connect inter-
processor memory.
CPU Memory CPU Memory
CPU Memory CPU Memory
Figure 1.7: Distributed Memory Systems
• Processors have their own local memory. Memory addresses in one processor do not map to
another processor, so there is no concept of global address space across all processors.
• Because each processor has its own local memory, it operates independently.
• Changes it makes to its local memory have no effect on the memory of other processors. Hence,
the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the programmer
to explicitly define how and when data is communicated. Synchronization between tasks is likewise
the programmer's responsibility.
• Modern multicomputer use hardware routers to pass message.
By - Aniket Sugandhi
Page no: 5 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Advantages:
• Memory is scalable with number of processors. Increase the number of processors and the size of
memory increases proportionately.
• Each processor can rapidly access its own memory without interference and without the overhead
incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
• The programmer is responsible for many of the details associated with data communication
between processors.
• It may be difficult to map existing data structures, based on global memory, to this memory
organization.
Multi-vector and SIMD Computers
A vector operand contains an ordered set of n elements, where n is called the length of the vector. Each
element in a vector is a scalar quantity, which may be a floating point number, an integer, a logical value or a
character.
A vector processor consists of a scalar processor and a vector unit, which could be thought of as an
independent functional unit capable of efficient vector operations.
Vector Supercomputer
Vector computers have hardware to perform the vector operations efficiently. Operands cann ot be used
directly from memory but rather are loaded into registers and are put back in registers after the operation.
Vector hardware has the special ability to overlap or pipeline operand processing. Vector functional units
pipelined, fully segmented each stage of the pipeline performs a step of the function on different operand(s)
once pipeline is full; a new result is produced each clock period (cp).
Figure 1.8: Architecture of Vector Supercomputer
SIMD Computer
The Synchronous parallel architectures coordinate Concurrent operations in lockstep through global clocks,
central control units, or vector unit controllers. A synchronous array of parallel processors is called an
array processor. These processors are composed of N identical processing elements (PES) under the
supervision of a one control unit (CU) This Control unit is a computer with high speed registers, local memory
and arithmetic logic unit.. An array processor is basically a single instruction and multiple data (SIMD)
computers. There are N data streams; one per processor, so different data can be used in each processor.
By - Aniket Sugandhi
Page no: 6 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
The figure below show a typical SIMD or array processor
Shared memory or interconnection network
Data
1 2 N
Streams
Processor 1 Processor 2 Processor N
Instruction Steam
Control
Figure 1.9: Configurations of SIMD Computers
These processors consist of a number of memory modules which can be either global or dedicated to each
processor. Thus the main memory is the aggregate of the memory modules. These Processing elements and
memory unit communicate with each other through an interconnection network. SIMD processors are
especially designed for performing vector computations. SIMD has two basic architectural organizations
a. Array processor using random access memory
b. Associative processors using content addressable memory.
All N identical processors operate under the control of a single instruction stream issued by a central control
unit. The popular examples of this type of SIMD configuration is ILLIAC IV, CM-2, MP-1. Each PEi is
essentially an arithmetic logic unit (ALU) with attached working registers and local memory PEMi for the
storage of distributed data. The CU also has its own main memory for the storage of program. The function of
CU is to decode the instructions and determine where the decoded instruction should be executed. The PE
perform same function (same instruction) synchronously in a lock step fashion under command of CU.
Data and Resource Dependence
Data dependence: The ordering relationship between statements is indicated by the data dependence. Five
type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists from s1 to S2 and if
at least one output (variables assigned) of S1feeds in as input (operands to be used) to S2 also called
RAW hazard and denoted as S1 -> S2
2. Anti-dependence: Statement S2 is anti-dependent on the statement S1 if S2 follows S1 in the program
order and if the output of S2 overlaps the input to S1 also called RAW hazard and denoted as S1 |->S2
3. Output dependence: two statements are output dependent if they produce (write) the same output
variable. Also called WAW hazard and denoted as S1 0->S2
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs not because the same
variable is involved but because the same file referenced by both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot be determined in the
following situations:
• The subscript of a variable is itself subscribed (indirect addressing).
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different coefficients of the loop variable.
• The subscript is non linear in the loop index variable.
• Parallel execution of program segments which do not have total data independence can produce
non-deterministic results.
Consider the following fragment of any program: S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
By - Aniket Sugandhi
Page no: 7 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
• Here the Forward dependency S1to S2, S3 to S4, S2 to S2
• Anti-dependency from S2to S3
• Output dependency S1 toS3
Control Dependence: This refers to the situation where the order of the execution of statements cannot be
determined before run time. For example all condition statement, where the flow of statement depends on
the output. Different paths taken after a conditional branch may depend on the data hence we need to
eliminate this data dependence among the instructions. This dependence also exists between operations
performed in successive iterations of looping procedure. Control dependence often prohibits
parallelism from being exploited.
Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
Control dependence also avoids parallelism to being exploited. Compilers are used to eliminate this
control dependence and exploit the parallelism.
Resource dependence:
Data and control dependencies are based on the independence of the work to be done. Resource
independence is concerned with conflicts in using shared resources, such as registers, integer and floating
point ALUs, etc. ALU conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
Bernstein’s Conditions - 1
Bernstein’s conditions are a set of conditions which must exist if two processes can execute in parallel.
Notation
Ii is the set of all input variables for a process Pi. Ii is also called the read set or domain of Pi. Oi is the set of
all output variables for a process Pi .Oi is also called write set
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
Bernstein’s Conditions - 2
In terms of data dependencies, Bernstein’s conditions imply that two processes can execute in parallel if
they are flow-independent, anti-independent, and output- independent. The parallelism relation || is
commutative (Pi || Pj implies Pj || Pi), but not transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk).
Therefore, || is not an equivalence relation. Intersection of the input sets is allowed.
Hardware and Software Parallelism
Hardware parallelism is defined by machine architecture and hardware multiplicity i.e., functional parallelism
times the processor parallelism .It can be characterized by the number of instructions that can be issued per
machine cycle. If a processor issues k instructions per machine cycle, it is called a k-issue processor.
Conventional processors are one-issue machines. This provides the user the information about peak attainable
performance.
By - Aniket Sugandhi
Page no: 8 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is revealed in the
program’s flow graph i.e., it is defined by dependencies within the code and is a function of algorithm,
programming style, and compiler optimization.
Program partitioning and scheduling
Scheduling and allocation is a highly important issue since an inappropriate scheduling of tasks can fail to
exploit the true potential of the system and can offset the gain from parallelization. In this paper we focus on
the scheduling aspect. The objective of scheduling is to minimize the completion time of a parallel
application by properly allocating the tasks to the processors. In a broad sense, the scheduling problem exists
in two forms: static and dynamic. In static scheduling, which is usually done at compile time, the
characteristics of a parallel program (such as task processing times, communication, data dependencies, and
synchronization requirements) are known before program execution.
A parallel program, therefore, can be represented by a node- and edge-weighted directed acyclic graph (DAG),
in which the node weights represent task processing times and the edge weights represent data dependencies
as well as the communication times between tasks. In dynamic scheduling only, a few assumptions about the
parallel program can be made before execution, and thus, scheduling decisions have to be made on-the-fly.
The goal of a dynamic scheduling algorithm as such includes not only the minimization of the program
completion time but also the minimization of the scheduling overhead which constitutes a significant portion
of the cost paid for running the scheduler. In general dynamic scheduling is an NP hard problem.
Grain size and latency
The size of the parts or pieces of a program that can be considered for parallel execution can vary. The sizes
are roughly classified using the term “granule size,” or simply “granularity.” The simplest measure, for
example, is the number of instructions in a program part. Grain sizes are usually described as fine,
medium or coarse, depending on the level of parallelism involved.
Latency
Latency is the time required for communication between different subsystems in a computer. Memory
latency, for example, is the time required by a processor to access memory. Synchronization latency is the
time required for two processes to synchronize their execution. Computational granularity and
communication latency are closely related. Latency and grain size are interrelated and some general
observation is
• As grain size decreases, potential parallelism increases, and overhead also increases.
• Overhead is the cost of parallelizing a task. The principle overhead is communication latency.
• As grain size is reduced, there are fewer operations between communications, and hence the impact
of latency increases.
• Surface to volume: inter to intra-node communication.
Levels of Parallelism
Instruction Level Parallelism
This fine-grained or smallest granularity level typically involves less than 20 instructions per grain. The number
of candidates for parallel execution varies from 2 to thousands, with about five instructions or statements (on
the average) being the average level of parallelism.
Advantages:
• There are usually many candidates for parallel execution
• Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between iterations, it can
be handled by a pipeline, or by a SIMD machine. Most optimized program construct to execute on a
By - Aniket Sugandhi
Page no: 9 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
parallel or vector machine. Some loops (e.g. recursive) are difficult to handle. Loop-level parallelism is still
considered fine grain computation.
Figure 1.10: Level of Parallelism in Program Execution on Modern Computers
Procedure-level Parallelism
Medium-sized grain, usually less than 2000 instructions. Detection of parallelism is more difficult than with
smaller grains; inter-procedural dependence analysis is difficult and history-sensitive. Communication
requirement less than instruction level SPMD (single procedure multiple data) is a special case Multitasking
belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain level. Job steps can
overlap across different jobs. Multi-programming conducted at this level No compilers available to exploit
medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism
It corresponds to execution of essentially independent jobs or programs on a parallel computer. This is
practical for a machine with a small number of powerful processors, but impractical for a machine with a
large number of simple processors (since each processor would take too long to process a single job).
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies attributed to machine
architecture, technology, and communication patterns used. Latency imposes a limiting factor on machine
scalability. Ex. Memory latency increases as memory capacity increases, limiting the amount of memory that
can be used with a given tolerance for communication latency.
Inter-processor Communication Latency
• Needs to be minimized by system designer
• Affected by signal delays and communication patterns Ex. n communicating tasks may require n (n -
1)/2 communication links, and the complexity grows quadratically, effectively limiting the number of
processors in the system.
Communication Patterns
• Determined by algorithms used and architectural support provided
• Patterns include permutations broadcast multicast conference
• Tradeoffs often exist between granularity of parallelism and communication demand.
Program Graphs and Packing
A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node name, s = size (larger
By - Aniket Sugandhi
Page no: 10 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
s = larger grain size).
Edges = { (v,d) }, where v = variable being “communicated,” and d = communication delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more edges to other nodes.
Packing is done to eliminate unnecessary communication delays or reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication delay requirements
are observed, and no two nodes are executing on the same processor at the same time. Some general
scheduling goals:
• Schedule all fine-grain activities in a node to the same processor to minimize communication
delays.
• Select grain sizes for packing to achieve better schedules for a particular parallel machine.
• Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not always produce a
shorter schedule. By duplicating nodes (that is, executing some instructions on multiple processors),
we may eliminate some interprocessor communication, and thus produce a shorter schedule.
Program flow mechanism
Conventional machines used control flow mechanism in which order of program execution explicitly
stated in user programs. Dataflow machines which instructions can be executed by determining operand
availability.
Reduction machines trigger an instruction’s execution based on the demand for its results.
Control Flow vs. Data Flow: In Control flow computers the next instruction is executed when the last
instruction as stored in the program has been executed where as in Data flow computers an instruction
executed when the data (operands) required for executing that instruction is available Control flow machines
used shared memory for instructions and data. Since variables are updated by many instructions, there may
be side effects on other instructions. These side effects frequently prevent parallel processing. Single
processor systems are inherently sequential.
Instructions in dataflow machines are unordered and can be executed as soon as their operands are available;
data is held in the instructions themselves. Data tokens are passed from an instruction to its dependents to
trigger execution.
Data Flow Features
No need for shared memory program counter control sequencer Special mechanisms are required to detect
data availability match data tokens with instructions needing them enable chain reaction of asynchronous
instruction execution
A Dataflow Architecture – 1 The Arvind machine (MIT) has N PEs and an N -by –N interconnection network.
Each PE has a token-matching mechanism that dispatches only instructions with data tokens available. Each
datum is tagged with
• address of instruction to which it belongs
• context in which the instruction is being executed
Tagged tokens enter PE through local path (pipelined), and can also be communicated to other PEs through
the routing network. Instruction addresses effectively replace the program counter in a control flow machine.
Context identifier effectively replaces the frame base register in a control flow machine. Since the dataflow
machine matches the data tags from one instruction with successors, synchronized instruction execution is
implicit.
An I-structure in each PE is provided to eliminate excessive copying of data structures. Each word of the I-
structure has a two-bit tag indicating whether the value is empty, full, or has pending read requests.
This is a retreat from the pure dataflow approach. Special compiler technology needed for dataflow machines.
By - Aniket Sugandhi
Page no: 11 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Demand-Driven Mechanisms
Data-driven machines select instructions for execution based on the availability of their operands; this is
essentially a bottom-up approach.
Demand-driven machines take a top-down approach, attempting to execute the instruction (a
demander) that yields the final result. This triggers the execution of instructions that yield its operands, and so
forth. The demand-driven approach matches naturally with functional programming languages (e.g. LISP and
SCHEME).
Pattern driven computers: An instruction is executed when we obtain a particular data patterns as output.
There are two types of pattern driven computers
String-reduction model: each demander gets a separate copy of the expression string to evaluate each
reduction step has an operator and embedded reference to demand the corresponding operands each
operator is suspended while arguments are evaluated
Graph-reduction model: expression graph reduced by evaluation of branches or sub-graphs, possibly in
parallel, with demanders given pointers to results of reductions. Based on sharing of pointers to arguments;
traversal and reversal of pointers continues until constant arguments are encountered.
System interconnect architecture
Various types of interconnection networks have been suggested for SIMD computers. These are basically
classified have been classified on network topologies into two categories namely
• Static Networks
• Dynamic Networks
Static versus Dynamic Networks
The topological structure of an SIMD array processor is mainly characterized by the data routing network used
in interconnecting the processing elements.
The topological structure of an SIMD array processor is mainly characterized by the data routing network used
in the interconnecting the processing elements. To execute the communication the routing function f is
executed and via the interconnection network the PEi copies the content of its Ri register into the Rf(i)
register of PE f(i). The f(i) the processor identified by the mapping function f. The data routing operation
occurs in all active PEs simultaneously.
Network properties and routing
The goals of an interconnection network are to provide low-latency high data transfer rate wide
communication bandwidth. Analysis includes latency bisection bandwidth data- routing functions scalability of
parallel architecture These Network usually represented by a graph with a finite number of nodes linked by
directed or undirected edges.
Number of nodes in graph = network size.
Number of edges (links or channels) incident on a node = node degree d (also note in and out degrees when
edges are directed).
Node degree reflects number of I/O ports associated with a node, and should ideally be small and constant.
Network is symmetric if the topology is the same looking from any node; these are easier to implement or to
program.
Diameter: The maximum distance between any two processors in the network or in other words we can
say Diameter, is the maximum number of (routing) processors through which a message must pass on its way
from source to reach destination. Thus diameter measures the maximum delay for transmitting a message
from one processor to another as it determines communication time hence smaller the diameter better
will be the network topology.
By - Aniket Sugandhi
Page no: 12 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Connectivity: How many paths are possible between any two processors i.e., the multiplicity of paths between
two processors. Higher connectivity is desirable as it minimizes contention.
Arch connectivity of the network: the minimum number of arcs that must be removed for the network to
break it into two disconnected networks. The arch connectivity of various network are as follows
• 1 for linear arrays and binary trees
• 2 for rings and 2-d meshes
• 4 for 2-d torus
• d for d-dimensional hypercubes
Larger the arch connectivity lesser the conjunctions and better will be network topology. Channel width :The
channel width is the number of bits that can communicated simultaneously by a interconnection bus
connecting two processors:
Bisection Width and Bandwidth: In order divide the network into equal halves we require the remove some
communication links. The minimum numbers of such communication links that have to be removed are called
the Bisection Width. Bisection width basically provide us the information about the largest number of
messages which can be sent simultaneously (without needing to use the same wire or routing processor at
the same time and so delaying one another), no matter which processors are sending to which other
processors. Thus larger the bisection width is the better the network topology is considered. Bisection
Bandwidth is the minimum volume of communication allowed between two halves of the network with
equal numbers of processors. This is important for the networks with weighted arcs where the weights
correspond to the link width i.e., (how much data it can transfer). The Larger bisection width the better
network topology is considered.
Cost the cost of networking can be estimated on variety of criteria where we consider the number of
communication links or wires used to design the network as the basis of cost estimation, smaller the better
the cost.
Data Routing Functions: A data routing network is used for inter –PE data exchange. It can be static as in case
of hypercube routing network or dynamic such as multistage network. Various type of data routing functions
are Shifting, Rotating, Permutation (one to one), Broadcast (one to all), Multicast (many to many),
Personalized broadcast (one to many), Shuffle, Exchange Etc.
Factors Affecting Performance
Functionality – how the network supports data routing, interrupt handling, synchronization, request/message
combining, and coherence.
Network latency – worst-case time for a unit message to be transferred
Bandwidth – maximum data rate.
Hardware complexity – implementation costs for wire, logic, switches, connectors, etc. Scalability – how
easily does the scheme adapt to an increasing number of processors, memories, etc.
Static interconnection networks
Static interconnection networks for elements of parallel systems (ex. processors, memories) are based on
fixed connections that cannot be modified without a physical re-designing of a system. Static interconnection
networks can have many structures such as a linear structure (pipeline), a matrix, a ring, a torus, a complete
connection structure, a tree, a star, a hyper-cube.
In linear and matrix structures, processors are interconnected with their neighbors in a regular structure on a
plane. A torus is a matrix structure in which elements at the matrix borders are connected in the frame of the
same lines and columns. In a complete connection structure, all elements (ex. processors) are directly
interconnected (point-to-point)
By - Aniket Sugandhi
Page no: 13 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Figure 1.11: Linear structure (pipeline) of interconnections in a parallel system
Figure 1.12: 2D Torus Figure 1.13: Matrix
Figure 1.14: A complete interconnection Figure 1.15: A Ring Figure 1.16: A Chordal Ring
In a tree structure, system elements are set in a hierarchical structure from the root to the leaves, see the
figure below. All elements of the tree (nodes) can be processors or only leaves are processors and the rest of
nodes are linking elements, which intermediate in transmissions. If from one node, 2 or more connections go
to different nodes towards the leaves - we say about a binary or k-nary tree. If from one node, more than one
connection goes to the neighboring node, we speak about a fat tree. A binary tree, in which in the direction of
the root, the number of connections between neighboring nodes increases twice, provides a uniform
transmission throughput between the tree levels, a feature not available in a standard tree.
Figure 1.17: Binary tree Figure 1.18: Fat tree
In a hypercube structure, processors are interconnected in a network, in which connections between
processors correspond to edges of an n-dimensional cube. The hypercube structure is very advantageous since
it provides a low network diameter equal to the degree of the cube. The network diameter is the number of
edges between the most distant nodes. . The network diameter determines the number in intermediate
By - Aniket Sugandhi
Page no: 14 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
transfers that have to be done to send data between the most distant nodes of a network. In this respect the
hypercubes have very good properties, especially for a very large number of constituent nodes. Due to this
hypercubes are popular networks in existing parallel systems.
Dynamic interconnection networks
Dynamic interconnection networks between processors enable changing (reconfiguring) of the connection
structure in a system. It can be done before or during parallel program execution. So, we can speak about
static or dynamic connection reconfiguration.
The dynamic networks are those networks where the route through which data move from one PE to
another is established at the time communication has to be performed. Usually all processing elements are
equidistant and an interconnection path is established when two processing element want to communicate by
use of switches. Such systems are more difficult to expand as compared to static network. Examples: Bus-
based, Crossbar, Multistage Networks. Here the Routing is done by comparing the bit-level representation of
source and destination addresses. If there is a match goes to next stage via pass- through else in case of
it mismatch goes via cross-over using the switch.
There are two classes of dynamic networks namely
• single stage network
• multi stage
Single Stage Networks
A single stage switching network with N input selectors (IS) and N output selectors (OS). Here at each network
stage there is a 1- to-D demultiplexer corresponding to each IS such that 1<D<N and each OS is an M-to-1
multiplexer such that 1<M<=N. Cross bar network is a single stage network with D=M=N. In order to establish
a desired connecting path different path control signals will be applied to all IS and OS selectors. The single
stage network is also called as re-circulating network as in this network connection the single data items may
have to re-circulate several time through the single stage before reaching their final destinations. The number
of recirculation depends on the connectivity in the single stage network. In general higher the hardware
connectivity the lesser is the number of recirculation. In cross bar network only one circulation is needed to
establish the connection path. The cost of completed connected cross bar network is O(N2) which is very high
as compared to other most re-circulating networks which have cost O(N log N) or lower hence are more cost
effective for large value of N.
Multistage Networks
Many stages of interconnected switches form a multistage SIMD network. It is basicaaly consist of three
characteristic features
• The switch box,
• The network topology
• The control structure
Many stages of interconnected switches form a multistage SIMD networks. Eachbox is essentially an
interchange device with two inputs and two outputs. The four possible states of a switch box are which
are shown in figure.
• Straight
• Exchange
• Upper Broadcast
• Lower broadcast.
A two function switch can assume only two possible state namely state or exchange states. However a
four function switch box can be any of four possible states. A multistage network is capable of
connecting any input terminal to any output terminal. Multi-stage networks are basically constructed by so
called shuffle-exchange switching element, which is basically a 2 x 2 crossbar. Multiple layers of these
elements are connected and form the network.
By - Aniket Sugandhi
Page no: 15 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
Figure 1.19: A two-by-two switching box and its four interconnection states
A multistage network is capable of connecting an arbitrary input terminal to an arbitrary output terminal.
G enerally it is consist of n stages where N = 2n is the number of input and output lines. And each stage use
N/2 switch boxes. The interconnection patterns from one stage to another stage are determined by network
topology. Each stage is connected to the next stage by at least N paths. The total wait time is proportional to
the number stages i.e., n and the total cost depend on the total number of switches used and that are Nlog2N.
The control structure can be individual stage control i.e., the same control signal is used to set all switch
boxes in the same stages thus we need n control signal. The second control structure is individual box
control where a separate control signal is used to set the state of each switch box. This provide flexibility at
the same time require n2/2 control signal which increases the complexity of the control circuit. In between
path is use of partial stage control.
Bus networks
A bus is the simplest type of dynamic interconnection networks. It constitutes a common data transfer path
for many devices. Depending on the type of implemented transmissions we have serial busses and parallel
busses. The devices connected to a bus can be processors, memories, I/O units, as shown in the figure below.
Figure 1.20: A diagram of a system based on a single bus Figure 1.21: A diagram of a system based on a multibus
Only one device connected to a bus can transmits data. Many devices can receive data. In the last case we
speak about a multicast transmission. If data are meant for all devices connected to a bus we speak about a
broadcast transmission. Accessing the bus must be synchronized. It is done with the use of two methods: a
token method and a bus arbiter method. With the token method, a token (a special control message or signal)
is circulating between the devices connected to a bus and it gives the right to transmit to the bus to a single
device at a time. The bus arbiter receives data transmission requests from the devices connected to a bus. It
selects one device according to a selected strategy (ex. using a system of assigned priorities) and sends an
acknowledge message (signal) to one of the requesting devices that grants it the transmitting right. After the
selected device completes the transmission, it informs the arbiter that can select another request. The
receiver (s) address is usually given in the header of the message. Special header values are used for the
broadcast and multicasts. All receivers read and decode headers. These devices that are specified in the
header, read-in the data transmitted over the bus.
By - Aniket Sugandhi
Page no: 16 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
The throughput of the network based on a bus can be increased by the use of a multibus network shown in
the figure below. In this network, processors connected to the busses can transmit data in parallel (one for
each bus) and many processors can read data from many busses at a time.
Crossbar switches
A crossbar switch is a circuit that enables many interconnections between elements of a parallel system at a
time. A crossbar switch has a number of input and output data pins and a number of control pins. In response
to control instructions set to its control input, the crossbar switch implements a stable connection of a
determined input with a determined output. The diagrams of a typical crossbar switch are shown in the figure
below.
Figure 1.22: Crossbar switch
Figure 1.23: Crossbar switch a) general scheme, b) internal structure
Control instructions can request reading the state of specified input and output pins i.e. their current
connections in a crossbar switch. Crossbar switches are built with the use of multiplexer circuits, controlled by
latch registers, which are set by control instructions. Crossbar switches implement direct, single non-blocking
connections, but on the condition that the necessary input and output pins of the switch are free. The
connections between free pins can always be implemented independently on the status of other connections.
New connections can be set during data transmissions through other connections. The non-blocking
connections are a big advantage of crossbar switches. Some crossbar switches enable broadcast transmissions
but in a blocking manner for all other connections. The disadvantage of crossbar switches is that extending
their size, in the sense of the number of input/output pins, is costly in terms of hardware. Because of that,
crossbar switches are built up to the size of 100 input/output pins.
Multiport Memory
In the multiport memory system, different memory module and CPUs have separate buses. The module has
internal control logic to determine port which will access to memory at any given time. Priorities are assigned
to each memory port to resolve memory access conflicts.
Advantages:
Because of the multiple paths high transfer rate can be achieved.
Disadvantages:
By - Aniket Sugandhi
Page no: 17 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in
Downloaded from www.rgpvnotes.in
It requires expensive memory control logic and a large number of cables and connections.
Figure 1.24: Multiport memory organization
Multistage and combining networks
Multistage connection networks are designed with the use of small elementary crossbar switches (usually they
have two inputs) connected in multiple layers. The elementary crossbar switches can implement 4 types of
connections: straight, crossed, upper broadcast and lower broadcast. All elementary switches are controlled
simultaneously. The network like this is an alternative for crossbar switches if we have to switch a large
number of connections, over 100. The extension cost for such a network is relatively low.
In such networks, there is no full freedom in implementing arbitrary connections when some connections have
already been set in the switch. Because of this property, these networks belong to the category of so called
blocking networks.
However, if we increase the number of levels of elementary crossbar switches above the number necessary to
implement connections for all pairs of inputs and outputs, it is possible to implement all requested
connections at the same time but statically, before any communication is started in the switch. It can be
achieved at the cost of additional redundant hardware included into the switch. The block diagram of such a
network, called the Benes network, is shown in the figure below.
Figure 1.25: A multistage connection network for parallel systems
To obtain nonblocking properties of the multistage connection network, the redundancy level in the circuit
should be much increased. To build a nonblocking multistage network n x n, the elementary two-input
switches have to be replaced by 3 layers of switches n x m, r x r and m x n, where m ³ 2n - 1 and r is the
number of elementary switches in the layer 1 and 3. Such a switch was designed by a French mathematician
Clos and it is called the Clos network. This switch is commonly used to build large integrated crossbar switches.
The block diagram of the Clos network is shown in the figure below.
Figure 1.26: A nonblocking Clos interconnection network
By - Aniket Sugandhi
Page no: 18 Follow us on facebook to get instant updates: fb.me/rgpvnotes.in