Module-1 Theory of Parallelism
Parallel Computer Models, The State of Computing,Multiprocessors and Multicomputer
,Multivector and SIMD Computers ,PRAM and VLSI Models, Program and Network Properties
,Conditions of Parallelism, Program Partitioning and Scheduling, Program Flow Mechanisms,.
What is Computer Architecture?
• Computer Architecture refers to
✔ Instruction set design or Instruction set architecture (ISA)
✔ Design of various system building blocks -processor, main memory, cache, data
paths
✔ Arrangement by which the various system building blocks are interconnected and
interoperated to achieve desired system performance.
Instruction set architecture (ISA) ISA involves specifying
✔ registers,
✔ data types,
✔ instruction set,
✔ instruction formats,
✔ addressing modes of a machine or computer, which can be used by compiler
writer. ISA is the portion of computer visible to the programmer or compiler writer.
THE STATE OF COMPUTING
Computer Development Milestones
• Prior to I945, computers were made with mechanical or electromechanical parts.Modern
computers were marked by the introduction of electronic components.Over the past
several decades, electronic computers have gone through roughly five generations of
development.
The first generation (1945-1954)
• Vacuum tubes and magnetic drums for memories.
• Single central processing unit (cpu) which performed serial fixed-point arithmetic,
• Machine or assembly languages were
used. The second generation (1955-1964)
• The second generation (1955-1964) was marked by the use of
• discrete transistors, diodes, and magnetic ferrite cores, floating-point arithmetic,
• High level languages (HLLs), such as Fortran, Algol, and Cobol, were introduced along
with compilers.
The third generation (1965-1974)
• The third generation (1965-1974) began to use
• integrated circuits (ICs) for both logic and memory in small-scale or medium-scale
integration (SSI or MSI)
• Micro programmed control ,Pipelining and cache memory were introduced.
• Multiuser applications
The fourth generation (1974-1991)
The fourth generation (1974-1991) used
• large-scale or very large- scale integration (LSI or VLSI),
• Semiconductor memory replaced core memory
• Multiprocessors, multicomputers, supercomputers
• Multiprocessor OS, languages, compilers, and environments for parallel
processing. The fifth generation (1991- present)
• The fifth generation (1991-present) is highlighted by the use of advanced VLSI
processors, memory, superscalar processors, heterogeneous processing.
Summary of the five generations of electronic computer development.
Elements of Modern Computers
• Hardware, software, and programming elements of a modern computer system are briefly
introduced below in the context of parallel processing.
• computing problems have been labeled numerical computing, tedious integer or floating -
point computation, transaction processing, and large database management information
retrieval
Algorithms and Data Structures
•Traditional algorithms and data structures are designed for sequential machines.
• New, specialized algorithms and data structures are needed to exploit the capabilities of
parallel architectures.
Hardware Resources
• Processors, memory, and peripheral devices form the hardware core of a computer
system.
• instruction-set, memory organization, multiprocessors, supercomputers, multicomputers,
and massively parallel computers.
Operating System
• An effective operating system manages the allocation and deallocation of resources
during the execution of user programs.
• Mapping is a bidirectional process matching algorithmic structure with hardware
architecture, and vice versa. The mapping of algorithmic and data structures onto the
machine architecture includes processor scheduling, memory maps, interprocessor
communications, etc. These activities are usually architecture-dependent.
System Software Support The source code written in a HLL must be first translated into object
code by an optimizing compiler.
• The compiler assigns variables to registers or to memory words and reserves functional
units for operators.
• A loader is used to initiate the program execution through the OS kernel.
• Resource binding demands the use of the compiler, assembler, loader, and OS kernel to
commit physical machine resources to program execution.
• Parallel software can be developed using entirely new languages designed specifically
with parallel support as its goal, or by using extensions to existing sequential languages.
Compiler Support
• Parallelizing Compilers requires full detection of parallelism in source code, and
transformation of sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid compiler parallelizing
efforts
Evolution of Computer Architecture
• Started with the von Neumann architecture built as a sequential machine executing scalar
data .
• The von Neumann architecture is slow due to sequential execution of instructions in
programs.
• Lookahead techniques were introduced to prefetch instructions in order to overlap I/E
(instruction fetch/decode and execution) operations and to enable functional parallelism.
• Functional parallelism was supported by two approaches: one is to use multiple
functional units simultaneously, and the other is to use pipelining.
• Pipelining has proven especially attractive in performing identical operations repeatedly
over vector data strings.
• Vector operations were originally carried out implicitly by software-controlled looping
using scalar pipeline processors.
Flynn's Classification
• Michael Flynn (1972) introduced a classification of various computer architectures based
on notions of instruction and data streams processed simultaneously.
• SISD (single instruction stream over a single data stream)
• SIMD (single instruction stream over multiple data streams) machines
• MIMD(multiple instruction streams over multiple data streams)
• MISD (multiple instruction streams and a single data stream)
SISD (single instruction stream over a single data stream)
• Conventional sequential machines are called SISD (single instruction stream over a
single data stream) computers.
• Instructions are executed sequentially.
SIMD (single instruction stream over multiple data streams) machines
• SIMD (single instruction stream over multiple data streams) machines allow vector
processing.
• Uses an array of processing elements(PEs) synchronized by the same controller.
• Single instruction stream is executed by different processing unit on different set of data.
PE = Processing Element LM = Local memory
MIMD(multiple instruction streams over multiple data streams)
• Parallel computers (multicomputers and multiprocessors)
• Multiple instruction: every processor may be executing a different instruction stream
• Multiple data: every processor may be working with a different data stream
• There are different processor each processing different task.
• Multiple data stream is provided by shared memory.
MISD (multiple instruction streams and a single data stream) machines
• The same data stream flows through a linear array of processors
• They execute different instruction streams.
• This architecture is also known as systolic arrays.
• Most parallel computers are built with MIMD model for general purpose
computations.The SIMD and MISD models are more suitable for special-purpose
computations. For this reason, MIMD is the most popular model, SIMD next, and MISD
the least popular model being applied in commercial machines.
Development Layers
A layered development of parallel computers is illustrated in Fig. Hardware configurations differ
from machine to machine, even those of the same model. The address space of a processor in a
computer system varies among different architectures. It depends on the memory organization,
which is machine-dependent. These features are up to the designer and should match the target
application domains.
we want to develop application programs and programming environments which are machine-
independent. Independent of machine architecture, the user programs can be ported to many
computers with minimum conversion costs.
High-level languages and communication models depend on the architectural choices made in a
computer system. From a programmer's viewpoint, these two layers should be architecture-
transparent.
However, the communication models, shared variables versus message passing, are mostly
machine-dependent. Application programmers prefer more architectural transparency. However,
kernel programmers have to explore the opportunities supported by hardware.
As a good computer architect, one has to approach the problem from both ends. The compilers
and OS support should be designed to remove as many architectural constraints as possible from
the programmer.
System Attributes to Performance
• They can be used to guide system architects in designing better machines or to educate
programmers or compiler writers in optimizing the codes for more efficient execution by
the hardware.
The total CPU time needed for program execution
• The CPU (or the processor) is driven by a clock with a constant cycle time τ.
• The size of a program is its instruction count (Ic), in terms of the number of machine
instructions to be executed in the program.
• The CPU time (T in seconds/program) needed to execute the program is :
T = Ic * CPI * τ Clock
Rate and CPI
• Different machine instructions may require different numbers of clock cycles to execute.
• Therefore, the term cycles per instruction (CPI) to mean the average value.
• The CPU of computer is driven by a clock with a constant cycle time τ.
• The inverse of the cycle time is the
clock rate f= 1 / τ
• The execution of an instruction requires access to the memory.
• The CPI of an instruction type can be divided into two component terms corresponding to
the total processor cycles and memory cycles needed to complete the execution of the
instruction.
• The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*τ
where
• p is the number of processor cycles needed for the instruction decode and execution,
• m is the number of memory references needed,
• k is the memory access latency,
• Ic is the instruction count, and
• τ is the processor cycle time.
System Attributes
T = Ic* (p + m*k)*τ
Five performance factors : Ic, p, m, k, τ
Where, p is the number of processor cycles needed for the instruction decode and execution, m is
the number of memory references needed, k is the memory access latency, Ic is the instruction
count, and τ is the processor cycle time.
These are influenced by four system attributes:
• Instruction-set architecture,
• Compiler technology,
• CPU implementation and control,
• Cache and memory hierarchy as specified in Table
The instruction-set architecture affects the program length (Ic) and processor cycle needed (p).
The compiler technology affects the values of Ic ,p, and the memory reference count (m). The
CPU implementation and control determine the total processor time (p. τ) needed. Finally, the
memory technology and hierarchy design affect the memory access latency (k • τ ) . The above
CPU time can be used as a basis in estimating the execution rate of a processor.
MIPS Rate
• The processor speed is often measured in terms of million instructions per second
(MIPS).
• The MIPS rate of a given processor
where C is the total number of clock cycles needed to execute a given program.
Floating Point Operations per Second
• Most compute-intensive applications in science and engineering make heavy use of
floating point operations.
• for such applications a more relevant measure of performance is floating point operations
per second or flops.
• This is written as Megaflops(mflops), gigaflops(gflops), teraflops or petaflops.
Throughput Rate:
• The CPU throughput is a measure of how many programs can be executed per second
• unit is programs/second
1. Consider the execution of an object code with 200,000 instructions on a 40 MHz
processor. The program consists of four major types of instructions. The instruction mix
and the number of cycles [CPI] needed for each instruction type are given below:
I. Find total number of cycles required to execute program.
II. Calculate the average CPl when the program is executed on a uniprocessor .
III. Calculate MIPS rate.
2. A 400 MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Determine the effective CPI, MIPS rate and execution time for this program.
Progammming Environments
• When using a parallel computer, one desires a parallel environment where parallelism is
automatically exploited.
• Two approaches to parallel programming
Implicit Parallelism and Explicit Parallelism
Implicit Parallelism
• An implicit approach uses a conventional language, such as C, C++ to write the source
program.
• The sequentially coded source program is translated into parallel object code by a
parallelizing compiler.
• This compiler must be able to detect parallelism and assign target machine resources.
• This compiler approach has been applied in programming shared-memory
multiprocessors.
Explicit Parallelism
• The second approach requires more effort by the programmer to develop a source
program using parallel dialects of C, C++.
• Parallelism is explicitly specified in the user programs.
• This reduces the burden on the compiler to detect parallelism.
• Instead, the compiler needs to preserve parallelism.
Multiprocessors and Multicomputers
• Two categories of parallel computers.
• These physical models are distinguished by having a shared common memory or
unshared distributed memories.
Shared-Memory Multiprocessors
Three shared-memory multiprocessor models. These models differ in how the memory and
peripheral resources are shared or distributed.
• Uniform Memory-Access (UMA) model
• Non-Uniform Memory Access (NUMA) model
• Cache-only Memory Access (COMA) model
The UMA Model
• Uniform memory-access (UMA) model
• In a UMA multiprocessor model, the physical memory is uniformly shared by all the
processors. All processors have equal access time to all memory words. Each processor
may use a private cache. Peripherals are also shared in some fashion. Communication
among processors takes place using variables in the common memory.
• The system interconnect takes the form of a common bus, a crossbar switch, or a
multistage network. The UMA model is suitable for general purpose applications by
multiple users.
• It can be used to speed up the execution of a single large program in time-critical
applications. To coordinate parallel events, synchronization and communication among
processors are done through using shared variables in the common memory.
Non-Uniform Memory Access (NUMA) model
• A NUMA multiprocessor is a shared-memory system in which the access time varies
with the location of the memory word.
• Two NUMA machine models:
The shared local memory
• The shared memory is physically distributed to all processors, called local memories.
• Local memories are accessible by all processors. It is faster to access a local memory
with a local processor. The access of remote memory attached to other processors takes
longer due to the added delay through the interconnection network.
• [The BBN TC-2000 Butterfly 1980, multiprocessor had this configuration, machine had
up to 512 CPUs, each with local memory]
A hierarehieal cluster model
• globally shared memory can be added to a multiprocessor system.
• In this case, there are three memory-access patterns: The fastest is local memory access.
The next is global memory access. The slowest is access of remote memory.
• The processors are divided into several clusters- Each cluster is itself an UMA or a
NUMA multiprocessor.
• The clusters are connected to global shared-memory modules. The entire system is
considered a NUMA multiprocessor. All processors belonging to the same cluster are
allowed to uniformly access the cluster shared-memory modules.
• [The Cedar multiprocessor, built at the University of Dlinois, assumes such a structure in
which each cluster is an Alliant FX/80 multiprocessor.]
The COMA model
• A multiprocessor using cache-only memory assumes the COMA model.
• The COMA model is a special case of a NUMA machine, in which the distributed main
memories are converted to caches.
• All the caches form a global address space.
• Remote cache access is assisted by the distributed cache directories. Depending on the
interconnection network used, sometimes hierarchical directories may be used to help
locate copies of cache blocks.
Distributed-Memory Multicomputers
• The system consists of multiple computers, often called nodes, interconnected by a
message-passing network.
• Each node is an autonomous computer consisting of a processor, local memory, and
sometimes attached disks or I/O peripherals.
• All local memories are private and are accessible only by local processors.
• For this reason, traditional multicomputers have been called no-remote-memory-access
(NORMA) machines.
• lnternode communication is carried out by passing messages through the static
connection network. With advances in interconnection and
network technologies, this model of computing has gained importance, because of its suitability
for certain applications, scalability, and fault-tolerance.
Multivector and SIMD Computers
• We classify supercomputers either as pipelined vector machines or as SIMD computers
emphasizing massive data parallelism.
[A vector operand contains an ordered set of n elements, where n is called the length of the
vector. Each element in a vector is a scalar quantity, which may be a floating point number, an
integer, a logical value or a character.]
Vector Supercomputers
• A vector computer is often built on top of a scalar processor. The vector processor is
attached to the scalar processor. Program and data are first loaded into the main memory
through a host computer. All instructions are first decoded by the scalar control unit- If
the decoded instruction is a scalar operation, it will be directly executed by the scalar
processor using the scalar functional pipelines.
• If the instruction is decoded as a vector operation, it will be sent to the vector control
unit.
This control unit will supervise the flow of vector data between the main memory and vector
functional pipelines. The vector data flow is coordinated by the control unit. A number of vector
functional pipelines may be built into a vector processor.
Two pipeline vector supercomputer models are described below.
Vector Processor Models:
register-to-register architecture.
• Vector registers are used to hold the vector operands, intermediate and final vector
results. The vector functional pipelines retrieve operands from and put results into the
vector registers.
• The length of each vector register is usually fixed. In general, there are fixed numbers of
vector registers and functional pipelines in a vector processor. Therefore, both resources
must be reserved in advance to avoid resource conflicts between different vector
operations.
memory-to-memory architecture.
• A memory-to-memory architecture differs from a register-to-register architecture in the
use of a vector stream unit to replace the vector registers.
• Vector operands and results are directly retrieved from the main memory.
SIMD Supercomputers
• An operational model of an SIMD computer is specified by a 5-tuple:
M=(N, C, I, M, R)
• N is the number of processing elements (PEs) in the machine.
• C is the set of instructions directly executed by the control unit (CU), including scalar and
program flow control instructions.
• I is the set of instructions broadcast by the CU to all PEs for parallel execution.
• M is the set of masking schemes, where each mask partitions the set of PEs into enabled
and disabled subsets.
• R is the set of data-routing functions, specifying various patterns to be set up in the
interconnection network for inter-PE communications.
One can describe a particular SIMD machine architecture by specifying the 5-tuple.
Program & Network properties
Conditions of parallelism
• The ability to execute several program segments in parallel requires each segment to be
independent of the other segments.
• The independence comes in various forms as defined below separately.
• we consider the dependence relations among instructions in a program. In general, each
code segment may contain one or more statements. We use a dependence graph to
describe the relations. The nodes of a dependence graph correspond to the program
statements [instructions], and the directed edges with different labels show the ordered
relations among the statements. The analysis of dependence graphs shows where
opportunity exists for parallelization and vectorization.
Data Dependence
• The ordering relationship between statements is indicated by the data dependence.
• Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
(operands to be used) to S2. Flow dependence is denoted as S1 -> S2.
2. Antidependence: Statement S2 is antidependent on the statement S1 if S2 follows S1 in
the program order and if the output of S2 overlaps the input to S1 also called RAW
hazard and is denoted as
3. Output dependence : two statements are output dependent if they produce (write) the
same output variable .
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs if same file
referenced by both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot be
determined in the following situations:
•The subscript of a variable is itself subscribed( indirect addressing)
•The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different coefficients of the
loop variable.
•The subscript is nonlinear in the loop index variable.
Ex: Consider the following fragment of any program:
S1 Load R1, A
S2 Add R2, R1 S3
Move R1, R3
S4 Store B, R1
Draw the Dependence graph.
Here we have,
• Flow dependence S1to S2, S3 to S4, S2 to S2
•Anti-dependence from S2to S3
•Output dependence S1 toS3
Dependence graph:
• The above data dependence relations should not be arbitrarily violated during program
execution. Otherwise, erroneous results may be produced with changed program order. On a
multiprocessor system, the program order may or may not be preserved, depending on the
memory model used.
Control Dependence
• This refers to the situation where the order of the execution of statements cannot be
determined before run time.
• For example all condition statement, where the flow of statement depends on the output.
• Paths taken after a conditional branch may introduce or eliminate data dependence
among instructions.
• Dependence may also exist between operations performed in successive iterations of a
looping procedure.
• Loop example with control-dependent iterations. The successive iterations of the
following loop are control-independent.
• Loop example without control-dependent iterations.
• Control dependence often prohibits parallelism from being exploited. Compiler
techniques or hardware branch prediction techniques are needed to get around the control
dependence in order to exploit more parallelism.
Resource Dependence
• Resource dependence is concerned with conflicts in using shared resources, such as
registers, integer and floating point ALUs, etc.
• ALU conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
The detection of parallelism in programs requires a check of the various dependence
relations.
Bernstein's Conditions
• Bernstein revealed a set of conditions based on which two processes can execute in
parallel.
• We define the input set Ii, of a process Pi, as the set of all input variables needed to
execute the process and the output set 0i , consists of all output variables generated after
execution of the process Pi. Bernstein‘s conditions—which apply to input and output sets
of processes—must be satisfied for parallel execution of processes.
• Consider two processes P1 and P2 with their input sets I1 and I2 and output sets 01 and
02, respectively.These two processes can execute in parallel and are denoted P1 || P2 if
they are independent and do not create confusing results.
• Conditions are stated as follows:
These three conditions are known as Bernstein conditions.
• In terms of data dependencies, Bernstein’s conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and output-
independent.
• In general, a set of processcs, PI, P2,P3, … Pk can execute in parallel if Bomstein's
conditions are satisfied on a pairwise basis; that is, , PI || P2 ||P3|| … ||Pk if and only if Pi
||Pj for all i ≠ j.
Ex: Detection of parallelism in a program using Bernstein's conditions
•
• The dependence graph that shows Data dependence (solid arrows) and resource
dependence (dashed arrows). It demonstrates data dependence as well as resource
dependence.
• Sequential execution and Parallel execution of the above program.In sequential execution
five steps are needed. If two adders are available simultaneously, the parallel execution
requires only three steps as shown.
• There are 10 pairs of statements to check against Bernstein's conditions. Only 5 pairs, P1||
P5, P2|| P3, P2|| P5, P5|| P3, and P4|| P5, can execute in parallel if there are no resource
conflicts. Collectively, only P2|| P3 || P5, is possible because P2||P3, P3|| P5, and P5||P2
are all possible.
Violations of any one or more of the three conditions of Bernstein's conditions prohibits
parallelism between two processes.In general, violation of any one or more of the 3n(n-1)/2
W2 Bernstein's conditions among n processes prohibits parallelism collectively or partially.
Hardware and Software Parallelism
• For implementation of parallelism, we need special hardware and software support.
• Hardware and software parallelism: This refers to the type of parallelism defined by the
machine architecture and hardware multiplicity.
• One way to characterize the parallelism in a processor is by the number of instruction
issues per machine cycle. If a processor issues It instructions per machine cycle, then it is
called a k-issue processor.
Hardware parallelism:
• It refers to support to parallelism through hardware multiplicity.It is characterized by the
number of instruction issues per machine cycle. If a processor issues k instructions per
machine cycle, then it is called a k-issue processor.
• For example, the lntel i96OCA is a three-issue processor with one arithmetic, one
memory access, and one branch instruction issues per cycle.
• Software parallelism: This type of parallelism is revealed in the program flow graph.
The program flow graph displays the patterns(number) of simultaneously executable
operations. It indicates the number of instruction executed per machine cycle through
program flow graph.
Example : Mismatch between software parallelism and hardware parallelism
• Consider the example program graph. There are eight instructions (four loads and four
arithmetic operations) to be executed in three consecutive machine cycles. Four load
operations are performed in the first cycle, followed by two multiply operations in the
second cycle and two add/subtract operations in the third cycle. Here,
• The software parallelism varies from 4 to 2 in three cycles. The average software
parallelism is equal to 8/3 = 2.67 instructions per cycle in this example program.
•
• consider execution of the same program by a two-issue processor which can execute one
memory access (load or write) and one arithmetic (add, subtract, multiply etc.) operation
simultaneously. With this hardware restriction, the program must execute in seven
machine cycles .
• Hardware parallelism displays an average value of 8/7 = 1.14 instructions executed per
cycle.
• This demonstrates a mismatch between the software parallelism and the hardware
parallelism.
• To match the software parallelism, we consider a hardware platform of a dual-processor
system, where single-issue processors are used.Here, L/S stands for load/store operations.
Six processor cycles are needed to execute the I2 instructions by two processors. S1 and
S2 are two inserted store operations, and L5 and L6 are two inserted load operations.
These added instructions are needed for inter processor communication through the
shared memory.
• Hardware parallelism an average value of 12/6 =2 instructions executed per cycle.
• To solve the mismatch problem between software parallelism and hardware parallelism,
one approach is to develop compilation support, and the other is through hardware
redesign.
• The Role of Compilers: Compiler techniques are used to exploit hardware features to
improve performance. Loop transformation, software pipelining, and features of
optimizing compilers are used for supporting parallelism.
• One must design the compiler and the hardware jointly at the same time. Interaction
between the two can lead to a better solution to the mismatch problem between software
and hardware parallelism.
Program Partitioning & Scheduling
• Program partitioning is a technique for decomposing a large program into many small
pieces for parallel execution by multiple processors. Program partitioning involves both
programmers and the compiler.
• Grain Sizes and Latency
• Grain size or granularity is a measure of the amount of computation involved in a
software process. The simplest measure is to count the number of instructions in a grain
(program segment). Grain sizes are commonly described as fine, medium, or coarse
depending on the processing levels involved.
• Grain size determines the basic program segment chosen for parallel processing.
• Latency is a time measure of the communication overhead incurred between machine
subsystems. For example, the memory latency is the time required by a processor to
access the memory.
• Parallelism has been exploited at various processing levels. As illustrated in Fig. five
levels of program execution represent different grain sizes and changing communication
and control requirements.
• Level 5: The grain size can
be as big as millions of
instructions(coarse-grain)
• Level 4: The grain size may
typically contain tens or
hundreds of thousands of
instructions.
• Level 3:grain size is less than
2000 instructions (medium-
grain)
• Level 2: grain size is less
than 500 instructions (fine-
grain)
• Level 1: grain size is less
than 20 instructions (fine-
grain)
Instruction Level : At instruction or statement level,a typical grain contains less than 20
instructions, called fine grain. Depending on individual programs, finegrain parallelism at this
level may range from two to thousands. The exploitation of fine-grain parallelism can be assisted
by an optimizing compiler which should be able to automatically detect parallelism and translate
the source code to a parallel form which can be recognized by the run-time system.
Loop Level : This corresponds to the iterative loop operations. A typical loop contains less than
500 instructions. Independent loop operations, can be used for vector processing pipelined execution.The
loop level is considered a fine grain of computation.
Procedure Level This level corresponds to medium-grain size at the task, procedural,
subroutine, and coroutine levels. A typical grain at this level contains less than 2000 instructions.
Detection of parallelism at this level is much more difficult than at the finer-grain levels. The
communication requirement is often less.
Subprogram Level : This corresponds to the level of job steps and related subprograms.
The grain size may typically contain thousands of instructions. Parallelism at this level has been
exploited by algorithm designers or programmers, rather than by compilers. We do not have
good compilers for exploiting medium- or coarse-grain parallelism at present.
Job (Program) Level : This corresponds to the parallel execution of independent jobs
(programs) on a parallel computer. The grain size can be as high as tens of thousands of
instructions in a single program.
To summarize, fine-grain parallelism is often exploited at instruction or loop levels,
preferably assisted by a compiler. Medium-grain parallelism at the task or job step demands
significant roles for the programmer as well as compilers. Coarse-grain parallelism at the
program level relies heavily on an effective OS and on the efficiency of the algorithm used.
Grain Packing and Scheduling
• Grain packing is to apply fine grain first in order to achieve a higher degree of
parallelism.
• Then combine (pack) multiple fine grain nodes into a coarse grain node if it can eliminate
unnecessary communications delays or reduce the overall scheduling overhead.
• To yield the shortest possible execution time.
• Introduced by Kruatraehue and Lewis (I938) for parallel programming applications.
• Usually, all fine-grain operations within a single coarse-grain node are assigned to the
same processor for execution.
• Fine-grain partition of a program often demands more inter processor communication
than that required in a coarse-grain partition.
The basic concept of Program partitioning, Grain Packing and Scheduling introduced below.
Example : Program graph before and after grain packing
• An example program graph in two different grain sizes. A program graph shows the
structure of a program.Each node in the program graph corresponds to a grain in the
program.
• The grain size is measured by the number of basic machine cycles needed to execute all
the operations within the node.
• We denote each node by a pair (n, s), where n is the node name (id) and s is the grain size
of the node.
• Fine-grain nodes have a smaller grain size, and coarse-grain nodes have a larger grain
size.
• The edge label (v, d) between two end nodes specifies the output variable v from the
source node or the input variable to the destination node, and the communication delay d
between them.
• There are 17 nodes in the fine-grain program graph (Fig. a) and 5 in the coarse-grain
program graph (Fig. b). The coarse-grain node is obtained by combining (grouping)
multiple fine-grain nodes. The fine grain corresponds to the following program.
• The idea of grain packing is to apply fine grain first in order to achieve a higher degree of
parallelism. Then one combines (packs) multiple fine-grain nodes into a coarse grain
node if it can eliminate unnecessary communications delays or reduce the overall
scheduling overhead.
•
• The fine—grain schedule is longer (42 time units) because more communication delays
were included as shown by the shaded area.
• The coarse-grain schedule is shorter (38 time units) because communication delays
among nodes 12, 13, and 14 within the same node D (and also the delays among 15,16,
and 17 within the node E) are eliminated after grain packing.
Static Multiprocessor Scheduling
• We consider Basic concepts behind multiprocessor scheduling using static schemes to
Produce a shorter schedule.
• Node duplication In order to eliminate the idle time and to further reduce the
communication delays among processors, one can duplicate some of the nodes in more
than one processor.
• Ex: Figure a shows a schedule without duplicating any of the five nodes. This schedule
contains idle time as well as long interprocessor delays (8 units) between P1 and P2. In
Fig. b, node A is duplicated into A' and assigned to P2 besides retaining the original copy
A in PL Similarly, a duplicated node C is copied into Pi besides the original node C in P2.
• The new schedule is almost 50% shorter. The reduction in schedule time is caused by
elimination of the (a, 8) and (c, B) delays between the two processors.
• Four major steps are involved in the grain determination and the process of scheduling
optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
PROGRAM FLOW MECHANISMS
• Order of
execution Computers
are based on
1. Control flow mechanism
2. Dataflow mechanism
3. Demand-Driven Mechanisms
Control flow mechanism
• In computers based on a control flow mechanism the order of program execution is
explicitly stated in the user programs.
• guided by a program counter
• use a program counter (PC) to sequence the execution of instructions in a program.
• Conventional von Neumann computers use a program counter (PC) to sequence the
execution of instructions in a program.
Dataflow mechanism
• The execution of any instruction depends on data (operand or data tokens) availability.
• The data generated by an instruction will be forwarded directly to all needy instructions.
• This scheme requires no program counter. It requires special mechanisms to detect data
availability, to match data tokens with needy instructions.
A Dataflow Architecture
• The following is a tagged-token architecture for dataflow computers.As shown in Fig.,
the global architecture consists of n processing elements PEs interconnected by an N -by
–N routing network.
• Each PE has a token-matching mechanism that dispatches only instructions with data
tokens available.
• Each datum is tagged with address of instruction to which it belongs.Instructions are
stored in the program memory. Tagged tokens enter the PE through a local path. The
tokens can also be passed to other PEs through the routing network. All internal token
circulation operations are pipelined.
• Another synchronization mechanism, called the I-structure, is provided within each PE. It
is a tagged memory unit. The purpose is to reduce excessive copying of large data
structures in dataflow operations.
Comparison of dataflow and control-flow computers
• A sample program and its dataflow graph
• The dataflow graph shows that 24 instructions are to be executed (8 divides, 8 multiplies,
and 8 adds).
• Assume that each add,multiply,divide requires 1, 2, and 3 cycles to complete,
respectively. Sequential execution of the 24 instructions on a control flow uniprocessor
takes 48 cycles to complete. On the other hand, a dataflow multiprocessor completes the
execution in 14 cycles.
Demand-Driven Mechanisms
Reduction machines are based on Demand-Driven Mechanisms.
• Reduction machines trigger an instruction’s execution based on the demand for its results.
Initiates an operation based on the demand for its results by other computations.
• Consider the evaluation of a nested arithmetic expression
• The data-driven computation chooses a bottom-up approach. starting from the
innermost operations b + I and d/e, then proceeding to the >< operation, and finally to the
outermost operation —.
• A demand-driven computation chooses a top-down approach by first demanding the
value of a, which triggers the demand for evaluating the next-level expressions (b+1) x c
and d/e, which in turn triggers the demand for valuating b + 1 at the innermost level.
• The results are then returned to the nested demander in the reverse order before a is
evaluated.
Dataflow and reduction models, are still concepts in the research stage.Control-flow machines
dominate the market. Until the data-driven or demand-driven mechanism is proven to be cost-
effective, the control-flow approach will continue to dominate the computer industry.
Comparison of Flow Mechanisms
Module 1- Important questions
1)Explain the elements of a modern computer system
2) Explain the general model of distributed memory multicomputer
3) Explain UMA multiprocessor model.
4) Explain data dependence in computing environment
5) Explain the different levels of parallelism in program execution.
6) Explain the various shared memory multiprocessors.
7) What are the five generations of electronic computers.
8) Explain vector supercomputer
9) Explain the operational model of SIMD supercomputer.
10) What are Bernstien’s condition?
11) Differentiate between control dependence and resource dependence.
12) What are software parallelism and hardware parallelism?
13) Explain different level of parallelism based on granularity?
14) With example, explain grain packing and scheduling?
15) Differentiate between program flow mechanisms?
16)
17)