Parallel Architecture
Classification
The following classification of parallel computers have been identified:
1) Classification based on the instruction and data streams
2) Classification based on the structure of computers
3) Classification based on how the memory is accessed
4) Classification based on grain size
Instruction Stream and Data Stream
The term ‘stream’ refers to a sequence or flow of either instructions or data
operated on by the computer. In the complete cycle of instruction execution, a
flow of instructions from main memory to the CPU is established.
This flow of instructions is called instruction stream. Similarly, there is a flow of
operands between processor and memory bi-directionally. This flow of operands
is called data stream.
Flynn’s Classification
Flynn’s classification is based on multiplicity of instruction streams and data streams
observed by the CPU during program execution. Let Is and Ds are minimum number of
streams flowing at any point in the execution, then the computer organisation can be
categorized:
Single Instruction and Single Data stream (SISD)
In this organisation, sequential execution of instructions is performed by one CPU containing a
single processing element (PE), i.e., ALU under one control unit as shown in Figure 4. Therefore,
SISD machines are conventional serial computers that process only one stream of instructions and
one stream of data. This type of computer organisation is depicted in the diagram:
Examples of SISD machines include:
• CDC 6600 which is unpipelined but has multiple functional units.
• CDC 7600 which has a pipelined arithmetic unit.
• Amdhal 470/6 which has pipelined instruction processing.
• Cray-1 which supports vector processing.
Single Instruction and Multiple Data stream (SIMD)
In this organisation, multiple processing elements work under the control of a single control unit. It
has one instruction and multiple data stream.
All the processing elements of this organization receive the same instruction broadcast from the CU.
Main memory can also be divided into modules for generating multiple data streams acting as a
distributed memory as shown in Figure 5. Therefore, all the processing elements simultaneously
execute the same instruction and are said to be 'lock-stepped' together.
Each processor takes the data from its own memory and hence it has on distinct data streams.
(Some systems also provide a shared global memory for communications.) Every processor must
be allowed to complete its instruction before the next instruction is taken for execution. Thus, the
execution of instructions is synchronous.
Examples of SIMD organisation are ILLIAC-IV, PEPE, BSP, STARAN, MPP, DAP and the
Connection
Machine (CM-1).
Multiple Instruction and Single Data stream (MISD)
In this organization, multiple processing elements are organised under the control of multiple
control units.
Each control unit is handling one instruction stream and processed through its corresponding
processing element.
But each processing element is processing only a single data stream at a time. Therefore, for
handling multiple instruction streams and single data stream, multiple control units and multiple
processing elements are organised in this classification.
All processing elements are interacting with the common shared memory for the organisation
of single data stream as shown in Figure 6.
The only known example of a computer capable of MISD operation is the C.mmp built by
Carnegie-Mellon University.
This classification is not popular in commercial machines as the concept of single data streams
executing on multiple processors is rarely applied.
But for the specialized applications, MISD organisation can be very helpful.
For example, Real time computers need to be fault tolerant where several processors execute
the same data for producing the redundant data. This is also known as N- version
programming.
All these redundant data are compared as results which should be same; otherwise faulty unit
is replaced. Thus MISD machines can be applied to fault tolerant real time computers.
Multiple Instruction and Multiple Data stream (MIMD)
In this organization, multiple processing elements and multiple control units are organized as in MISD.
But the difference is that now in this organization multiple instruction streams operate on multiple data
streams .
Therefore, for handling multiple instruction streams, multiple control units and multiple processing
elements are organized such that multiple processing elements are handling multiple data streams
from the Main memory
as shown in Figure 7.
The processors work on their own data with their own instructions. Tasks executed by different
processors can start or finish at different times. They are not lock-stepped, as in SIMD computers, but
run asynchronously.
This classification actually recognizes the parallel computer. That means in the real sense MIMD
organisation is said to be a Parallel computer. All multiprocessor systems fall under this classification.
Examples include; C.mmp, Burroughs D825, Cray-2, S1, Cray X-MP, HEP, Pluribus, IBM 370/168 MP,
Univac 1100/80, Tandem/16, IBM 3081/3084, C.m*, BBN Butterfly, Meiko Computing Surface (CS-1),
FPS T/40000, iPSC.
Of the classifications discussed above, MIMD organization is the most popular for a
parallel computer. In the real sense, parallel computers execute the instructions in
MIMD mode
2. HANDLER’S CLASSIFICATION
Handler's classification addresses the computer at three distinct levels:
• Processor control unit (PCU),
• Arithmetic logic unit (ALU),
• Bit-level circuit (BLC).
The PCU corresponds to a processor or CPU, the ALU corresponds to a functional
unit or a processing element and the BLC corresponds to the logic circuit needed to
perform one bit operations in the ALU
Handler's classification uses the following three pairs of integers to describe a
computer:
Computer = (p * p', a * a', b * b')
Where p = number of PCUs
Where p'= number of PCUs that can be pipelined
Where a = number of ALUs controlled by each PCU
Where a'= number of ALUs that can be pipelined
Where b = number of bits in ALU or processing element (PE) word
Where b'= number of pipeline segments on all ALUs or in a single PE
The following rules and operators are used to show the relationship between
various elements of the computer:
• The '*' operator is used to indicate that the units are pipelined or macro-pipelined
with a stream of data running through all the units.
• The '+' operator is used to indicate that the units are not pipelined but work on
independent streams of data.
• The 'v' operator is used to indicate that the computer hardware can work in one of
several modes.
• The '~' symbol is used to indicate a range of values for any one of the
parameters.
• Peripheral processors are shown before the main processor using another three
pairs of integers. If the value of the second element of any pair is 1, it may omitted
for brevity
Handler's classification is best explained by showing how the rules and operators
are used to classify several machines.
The CDC 6600 has a single main processor supported by 10 I/O processors. One
control unit coordinates one ALU with a 60-bit word length. The ALU has 10
functional units which can be formed into a pipeline. The 10 peripheral I/O
processors may work in parallel with each other and with the CPU. Each I/O
processor contains one 12-bit ALU.
The description for the 10 I/O processors is: CDC 6600I/O = (10, 1, 12)
The description for the main processor is: CDC 6600main = (1, 1 * 10, 60)
The main processor and the I/O processors can be regarded as forming a macro-pipeline
so the '*' operator is used to combine the two structures:
CDC 6600 = (I/O processors) * (central processor = (10, 1, 12) * (1, 1 * 10, 60)
Texas Instrument's Advanced Scientific Computer (ASC) has one controller coordinating
four arithmetic units. Each arithmetic unit is an eight stage pipeline with 64-bit words.
Thus we have:
ASC = (1, 4, 64 * 8)
While Flynn's classification is easy to use, Handler's classification is cumbersome.
The direct use of numbers in the nomenclature of Handler’s classification’s makes it
much more abstract and hence difficult.
Handler's classification is highly geared towards the description of pipelines and
chains.
While it is well able to describe the parallelism in a single processor, the variety of
parallelism in multiprocessor computers is not addressed well.
3. STRUCTURAL CLASSIFICATION
Flynn’s classification discusses the behavioural concept and does not take into
consideration the computer’s structure.
As we have seen, a parallel computer (MIMD) can be characterised as a set of multiple
processors and shared memory or memory modules communicating via an interconnection
network.
When multiprocessors communicate through the global shared memory modules then this
organisation is called Shared memory computer or Tightly coupled systems as shown in
Figure 9.
Similarly when every processor in a multiprocessor system, has its own local memory and
the processors communicate via messages transmitted between their local memories, then
this organisation is called Distributed memory computer or Loosely coupled system as
shown in Figure 10.
Shared Memory System / Tightly Coupled System
Shared memory multiprocessors have the following characteristics:
• Every processor communicates through a shared global memory
• For high speed real time processing, these systems are preferable as their throughput
is high as compared to loosely coupled systems.
i) Processor-Memory Interconnection Network (PMIN)
This is a switch that connects various processors to different memory modules.
Connecting every processor to every memory module in a single stage while the
crossbar switch may become complex. Therefore, multistage network can be adopted.
There can be a conflict among processors such that they attempt to access the same
memory modules. This conflict is also resolved by PMIN.
ii) Input-Output-Processor Interconnection Network (IOPIN)
This interconnection network is used for communication between processors and I/O
channels. All processors communicate with an I/O channel to interact with an I/O device
with the prior permission of IOPIN.
i) Interrupt Signal Interconnection Network (ISIN)
When a processor wants to send an interruption to another processor, then this interrupt
first goes to ISIN, through which it is passed to the destination processor. In this way,
synchronisation between processor is implemented by ISIN. Moreover, in case of failure
of one processor, ISIN can broadcast the message to other processors about its failure.
Since, every reference to the memory in tightly coupled systems is via interconnection network,
there is a delay in executing the instructions. To reduce this delay, every processor may use cache
memory for the frequent references made by the processor as
shown in Figure 12.
Uniform Memory Access Model (UMA)
In this model, main memory is uniformly shared by all processors in multiprocessor systems and each processor
has equal access time to shared memory. This model is used for time-sharing applications in a multi user
environment.
Non-Uniform Memory Access Model (NUMA)
In shared memory multiprocessor systems, local memories can be connected with every processor. The
collection of all local memories form the global memory being shared. In this way, global memory is distributed to
all the processors. In this case, the access to a local memory is uniform for its corresponding processor as it is
attached to the local memory. But if one reference is to the local memory of some other remote processor, then
the access is not uniform. It depends on the location of the memory. Thus, all memory words are not accessed
uniformly
Cache-Only Memory Access Model (COMA)
As we have discussed earlier, shared memory multiprocessor systems may use cache memories with every
processor for reducing the execution time of an instruction. Thus in NUMA model, if we use cache memories
instead of local memories, then it becomes COMA model. The collection of cache memories form a global
memory space. The remote cache access is also non-uniform in this model.
Loosely Coupled Systems
These systems do not share the global memory because shared memory concept gives rise to the
problem of memory conflicts, which in turn slows down the execution of instructions. Therefore, to
alleviate this problem, each processor in loosely coupled systems is having a large local memory (LM),
which is not shared by any other processor.
Thus, such systems have multiple processors with their own local memory and a set of I/O devices. This
set of processor, memory and I/O devices makes a computer system.
Therefore, these systems are also called multi-computer systems. These computer systems are
connected together via message passing interconnection network through which processes
communicate by passing messages to one another.
Since every computer system or node in multicomputer systems has a separate memory, they are called
distributed multicomputer systems.
Since local memories are accessible to the
attached processor only, no processor can
access remote memory. Therefore, these
systems are also known as no-remote
memory access (NORMA) systems.
Message passing interconnection network
provides connection to every node and inter-
node communication with message depends
on the type of interconnection network.
For example, interconnection network for a
non-hierarchical system can be shared bus.
CLASSIFICATION BASED ON GRAIN SIZE
This classification is based on recognizing the parallelism in a program to be executed on a
multiprocessor system. The idea is to identify the sub-tasks or instructions in a program that can be
executed in parallel. For example, there are 3 statements in a program and statements S1 and S2 can
be exchanged. That means, these are not sequential as shown in Figure 15. Then S1 and S2 can be
executed in parallel.
The decision of parallelism also depends on the following factors:
• Number and types of processors available, i.e., architectural features of
host computer
• Memory organization
• Dependency of data, control and resources
Parallelism based on Grain size
Grain size:
Grain size or Granularity is a measure which determines how much computation is involved in a
process.
Grain size is determined by counting the number of instructions in a program segment. The
following types of grain sizes have been identified
1) Fine Grain: This type contains approximately less than 20 instructions.
2) Medium Grain: This type contains approximately less than 500 instructions.
3) Coarse Grain: This type contains approximately greater than or equal to one thousand
instructions.
These parallelism levels form a hierarchy according to which, lower the level, the finer is the
granularity of the process. The degree of parallelism decreases with increase in level.
Every level according to a grain size demands communication and scheduling overhead.
Following are the parallelism levels (shown in Figure 18):
1) Instruction level:
This is the lowest level and the degree of parallelism is highest at this level.
The fine grain size is used at instruction or statement level as only few instructions form the
grain size here.
The fine grain size may vary according to the type of the program.
For example, for scientific applications, the instruction level grain size may be higher. As the
higher degree of parallelism can be achieved at this level, the overhead for a programmer will be
more.
2) Loop Level :
This is another level of parallelism where iterative loop instructions can be parallelized. Fine
grain size is used at this level also.
Simple loops in a program are easy to parallelize whereas the recursive loops are difficult.
This type of parallelism can be achieved through the compilers.
3) Procedure or SubProgram Level:
This level consists of procedures, subroutines or subprograms.
Medium grain size is used at this level containing some thousands of instructions in a procedure.
Multiprogramming is implemented at this level.
Parallelism at this level has been exploited by programmers but not through compilers.
Parallelism through compilers has not been achieved at the medium and coarse grain size.
4) Program Level:
It is the last level consisting of independent programs for parallelism.
Coarse grain size is used at this level containing tens of thousands of instructions.
Time sharing is achieved at this level of parallelism. Parallelism at this level has been exploited
through the operating system.
The relation between grain sizes and parallelism levels has been shown in Table 1.
Table 1: Relation between grain sizes and parallelism
Grain Size Parallelism Level
Fine Grain Instruction or Loop Level
Procedure or SubProgram
Medium Grain
Level
Coarse Grain Program Level
Coarse grain parallelism is traditionally implemented in tightly coupled or shared memory
multiprocessors like the Cray Y-MP.
Loosely coupled systems are used to execute medium grain program segments.
Fine grain parallelism has been observed in SIMD organization of computers.