Parallel Architecture Course Guide
Parallel Architecture Course Guide
To
Parallel Architecture and Computing
IT (FIFTH SEMESTER)
PTU Jalandher
By :- Er.Umesh Vats
Information Technology
BCET Ludhiana
PTU Syllabus
IT-309
PARALLEL ARCHITECTURE AND COMPUTING
COURSE CONTENTS
Input unit:-which accept the list of instruction to solve a problem and data relevant to a
problem.
Storage unit:-
Output unit:-
Control unit:-a control unit which interpret the instruction stored in memory and obediently
carries them out.
Output unit
Input processing
unit
ALU control
Processing element
The P.E(Processing element) retrieve one instruction at a time, interpret it and execute it .the
operation of this computer is sequential .at a time a P.E can execute only one instruction
.speed of the sequential computer is limited by the speed at which it can process the retrieved
data.
To increase the speed of processing of data one may interconnect many sequential
computer to work together .such a computer which consist of a number of
interconnected sequential computer which cooperatively execute a single program to
solve a problem is called a parallel computer.
Parallel processing:-
Parallel event may occur in multiple resources during the same time interval and pipelined
event may occur in overlapped time spans.
The highest level of parallel processing is conducted amoung multiple jobs or program
through multiprogramming, time sharing and multiprocessor.
1. Pipeline computer
2. Array processor
3. Multiprocessor system.
Pipeline computer:-
.Execution (EX)
Instruction fetch (IF) from memory then instruction decoding (ID),then identifying the
operation to be performed ,operand fetch (OF) and then execution (EX) of decode arithmetic
logic operation. In a pipelined computer successive instruction are executed in a overlapped
fashion. Four pipeline stage IF, ID, OF, and EX are arranged in linear cascade. A pipeline
processor can be represented as:
IF ID OF EX
PIPELINE PROCSSOR
But in non pipelined computer these four steps must be combined before the next instruction
can issued.
The space time diagram for pipeline and non-pipelined processor. in pipeline processor the
operation of all stages is synchronized under a common clock control. Interface latches are
used between adjacent segments to hold the intermediate result.
For nonpipelined processor computer it takes four pipeline cycles to complete one
instruction.
.Some main issue in designing a pipeline computer include job sequencing, collision
prevention, congestion control, branch handling, reconfiguration and hazard resolution.
Pipeline computer are more attractive for vector processing.
Array processor:-
CP
CM
Control unit is combination of control processor and control memory. In array processor each
processing element has its own private memory. Instruction fetch (from local memory or
from the control memory) decode is done by the control unit.
Array processors designed with associative memories are called associative processor.
Parallel algorithms on array processor will be given foe matrix multiplication, merge sort and
fast Fourier transform (FFT).
Multiprocessor system:-
.Crossbar switch
. Multiport memories
Z=(x+y)*2
Flynn’s classification:-
1. The term stream is used to denote a sequence of items as executed or operated upon
by a single processor.
2. An instruction stream is a sequence of instruction as executed by the machine.
1. SISD:-
a. Instruction are executed sequentially but may overlapped n their execution stages.
b. Single instruction: only one instruction stream is being acted on the CPU during
any one clock cycle.
c. Single data: only one data stream is being used on which instruction applied.
d. An SISD computer may have more one functional unit in it.
e. Deterministic execution.
f. E.g. older generation mainframe, minicomputer and workstation.
Fig:
IS
IS DS
CU PU MM
2. SIMD:-
a. SIMD stands for the single instruction multiple data stream.
b. There are multiple processing element supervised by the same control unit.
c. All PEs receive the same instruction broadcast from the control unit but operate
on different data sets from distinct machine.
d. SIMD machine further divided into word slice versus bit-slice modes.
e. Deterministic execution.
f. E.g. array processor and vector pipeline.
DS1
PU1
CU=Control
CU MM1
unit
PU2
DS2
MM2
. DSn
PUn MMn
PU =PROCESSOR UNIT
3. MISD:-
a. MISD stands for multiple instruction single data.
b. There are n processor units, each receiving distinct instruction operating over the
same data stream.
c. This exist conceptually not physically.
Dia. Are given below:
MISD COMPUTER
4. MIMD:-
a. MIMD stands for multiple instruction multiple data.
b. This is most common type of parallel computer.
c. Multiple instruction : every processor may be executing a different stream.
d. Multiple data: every processor may be working with a different data stream.
e. E.g. most current supercomputer ,multi-core PC’s.
Dia.are given below
MIMD COMPUTER:
Cu1 PU1
Is1 IS1 DS1 IS1
MM1
CU2 PU2
IS2 IS2 DS2 IS2
MM2
Handler’s classification:
Wolfing Handler has proposed a classification for identifying the parallelism degree and
pipeline degree built into the hardware structure of a computer system. Handler’s taxonomy
addresses the computer at three distinct levels:
e.g.
The Texas Instrument’s Advanced Scientific computer (TI-ASC) has one controller
controlling four arithmetic pipeline each has 64 bit word length and eight stage. thus
we have:
T(ASC)= <1*1, 4*1,64*8> = <1,4,64*8>
Whenever the second entity K’, D’, or w’ equal 1, we drop it since pipelining of one
stage or of one unit is meaningless.
Amdahl’s law:
Amdahl’s law is based on a fixed problem size or a fixed work load.
We seen that for a given problem size the speed up does not increase linearly as the
number of processor increase. in fact speed tends to become saturate.
Lets the time taken to perform the sequential operation (Ts)be a fraction ,α(0<α<1), of
the total execution time of the program, T(1).
Then the parallel operation take Tp= (1-α).T(1) time.
Assuming that the parallel operation in the program achieve a linear speedup ( i.e.
these operation using n processor take (1/n)th of the time taken to perform them on
the one processor ).
Then speedup with n processor is:
S (n) = T(1)/T(n)
= T(1)/ αT(1) + (1-α) T(1)/n
=1/α+ (1-α)/n
Parallel computing has been used to model difficult scientific and engineering problem found
in the real world. Some example:
Fig
2.Processor pipeline:-
Fig:-
A multifunction pipe may perform different function either at different times or at the same
time, by interconnecting different subset of stage in the pipeline.
Pipelining processing:-
Speed :-
Once the pipe is filled up, it will output one result per clock period
independent of the number of stage in the pipe .ideally a linear pipeline with k stage can
process n tasks in Tk=k + (n-1) clock period,
Where k cycle are used to fill up the pipeline or to complete execution of the first task and (n-
1) cycle are needed to complete the remaining (n-1) task. The same number of tasks can be
executed in a nonpipeline computer with an equivalent function in Ti=n.k time delay.
Speedup:- we define the speed of a k-stage linear pipeline processor over an equivalent
nonpipeline processor.
S=T1/Tk=n.k/k+(n-1)
Efficiency:-
The efficiency of the linear pipeline by the percentage of busy time space
span over the total time space span which equal the sum of all busy and idle time space
span .let n,k,t be the number of tasks ,the number pipeline stage and clock period of a linear
pipeline respectively .the pipeline efficiency is denoted by:-
Throughput:-
The number of result that can be completed by a pipeline per unit time is called
its throughput .
Performance :-
it can be describe by important attribute viz. The length of the pipe .the performance measure
of importance are effiency, speedup throughput.
Let n =length of the pipeline i.e. the number of the stages in the pipeline
Reservation table:-
The rows of reservation table represent the stage (or the one source of the pipeline) and each
column represent the clock time unit. The total number of clock units in the table is called the
evaluation time for given function.
E.g. if a pipelined system there are four resources and five time slice .then reservation table
will have four rows and column. All the elements of the table are either 0 or 1 .if one resource
(say, resource i) is used in a time slice (say time slice j) then the (i, j )th element of the table
will have the entry 1.
On the other hand ,if a resource is not used in a particular time slice then that entry of will
have the value 0.
Suppose that we have four resource and 6-time slices and usage of resource is as follows:
Different function may have different evaluation times .for example above table has 7
evaluation time .
Hazard :-
A hazard is a potential problem that can happen in a pipelined processor. it refer to the
possibility of erroneous computation when CPU tries to execute simultaneously multiple
instruction which exhibit data dependence.
Or
Delay in pipeline execution of instruction due to non ideal condition are called pipeline
hazard .
Type of hazards:-
Structural hazard:-
A structural hazard take place when a part of processor hardware is needed by two or more
processor at the same time.
Control hazard:-
Control hazard is also known as Branch hazard .a control hazard occurs when all
program have branches and loops .when execution of a program is thus not in a “straight
line”
If a certain condition is true then jump from one part of instruction part to another type of
instruction part.
Data hazard:-
Delay due to data dependency between instructions which is known as data hazard.
When successive instruction overlap their fetch , decode and execution through a pipelined
processor , interinstruction dependencies may arise to prevent the sequential data flow in the
pipeline.
RAW Hazard:-
A RAW hazard between the two instruction I and J may occurs when j
attempts to read some data object that has been modified by I.
R(I) П R(J) FOR RAW
D(I) R(I)
D(J) R(J)
2. WAR HAZARD :-
A WAR hazard may occur when J attempt to modify some data
object that is read by I.
D(I) П R(J)
D(J) R(J)
R(J)
D(I)
WAW hazard:-
A WAW hazard may occur if both I and J attempt to modify the
same data object.
R(I) П R(J)
FIG:WAW Hazards
D(I) R(I)
R(J)
D(J)
Hazard detection:-
Hazard Resolution:-
Once hazard is detected, the system should resolve the
interlock situation.
1. Consider the instruction sequence {....I, I+1,.........J, J+1...........} in which a hazard
has been detected between the current instruction J and a previous instruction I.
A straightforward approach is to stop the pipe and to suspend the execution of
instruction J, J+1, J+2.........until the instruction I has passed the point of resource
conflict.
2. A more sophisticated approaches is to suspend only instruction J and continue the
flow of instruction J+1, J+2.........down the pipe. Of course the potential hazard
due to the suspension of the J should be continuously checked as instruction J+1,
J+2.....MOVE ahead of J.
Note:-to avoid RAW hazard IBM engineer developed a short circuiting approach
which gives a copy of the data object to be written directly to the instruction waiting
to read the data.
The technique is also known as data forwarding.
Vector processor:-
Vector processor or the SIMD processor microprocessor that are specialized for operating on
vector or matrix data element. There processor have specialized hardware for performing
vector operation such as vector addition, vector multiplication and other operation.
Instruction pipeline:-
An instruction pipeline reads consective instruction from memory while previous instruction
are being executed in other segment. This cause instruction fetch and execute phase to
overlap and perform simultaneous operations. One possible digression associated with such a
scheme is that an instruction may cause a branch out of sequence. In that case the pipeline
must be emptied and all the instruction that have been read from memory after the branch
instruction must be discarded.
In general the computer needs to process each instruction with the following sequence of
steps:
1. Fetch the instruction from memory .(IF)
2. Decode the instruction. (ID)
3. Calculate the effective address.
4. Fetch the operands from memory. (OF)
5. Execute the instruction. (EX)
6. Store the result in the proper place. (ST)
Chapter 3.
In this sense, array processor are also known as SIMD computers. SIMD machines are
especially designed to perform vector computations over matrices or arrays of data.
SIMD computers appear in two basic architectural organisations: array processors, using
rabdom-access memory; and associative memory, using content addressable (or associative)
memory.
Control
Interconnection Network
Configuration I(Illiac IV)
a. Each PEi is essentially an arithmetic logic unit(ALU) with attached working registers
and local memory PEMi.
b. The CU also has its own memory for the storage of programs. The system and user
programs are executed under the control of one CU. The user programs are loaded
into the CU memory from an external source. The function of CU is to decode all the
instructions and determine where the decoded instructions should be executed.
c. Vector instructions are broadcast to the PEs for distributed execution to achieve
spatial parallelism. All the PEs perform the same function synchronously in the lock
step fashion under the command of the CU.
d. Vector operands are distributed to the PEMs before parallel execution in the array of
PEs. The distributed data can be loaded into the PEMs from an external source via
system data bus or via the CU in a broadcast mode using the control bus. Masking
Schemes are used to control the status each PE during the execution of a vector
instruction. Each PE may be either active or disabled during an instruction cycle.
Only enabled PEs perform computation. Interconnection network is under the control
of one control unit.
e.
I/O
Data bus
CU memory
memory
CU
M0 M1 M-P-1
Configuration II(BSP)
First, the local memories attached to the PEs arfe now replaced by parallel memory modules
shared by all the PEs through an alignment network.
Second, the inter Pe permutation network is replaced by the inter-PE memory alignment
network, which is again controlled by the CU.
In Configuration II(BSP), there are N PEs and P memory modules. The two numbers are not
necessarily equal. In fact thy have choosen to be relatively prime. The alignment network is a
path switching network between the PEs and the parallel memories.
C = <N,F,I,M>
M = the set of masking schemes, where each mask partitions the set of PEs into the
two disjoint subsets of enabled PEs and the disabled PEs.
Inter-PE communications
Network Design decisions for inter-PE communications are discussed below. The decisions
are made between operation modes, control strategies, switching methodologies, and network
topologies.
synchronous and
asynchronous.
A system may also be designed to facilitate both synchronous and asynchronous processing.
Therefore, the typical operation modes of interconnection networks can be classified into
three categories : synchronous, asynchronous, and the combined.
Control Strategy
Interconnection functions are realized by properly setting control of the switching elements.
The control setting function can be managed by the centralized controller or by the individual
switching element.
The latter strategy is called distributed control and the first strategy corresponds to
centralized control. Most existing SIMD interconnection networks choose the centralized
control on all switch elements by the control unit.
packet switching.
In packet switching, data is put in a packet and route through the interconnection network
without establishing a interconnection path.
In general, circuit switching is more suitable for bule data transmission, and packet switching
is more efficient for many short data messages. Another option, integrated switching,
includes the capabilities of both circuit switching and packet switching.
Network Topology :-
A network can be depicted by a graph in which nodes represent switching points and edges
represent communication links.
The topologies tend to be regular and can be grouped into two categories:
static and
dynamic .
In a static topology, Links between two processors are passive and dedicated buses cannot
be reconfigured for direct connections to other processors .
on the other hand, Links in the dynamic category can be reconfigured bty setting the
network’s active switching elements.
The space of the interconnection networks can be represented by the cartesian product of the
above four sets of design features: {operation mode} * {control strategy} * {switching
methodology} * {network topology}.
Various interconnection networks have been suggested for SIMD computers. We distinguish
between single stage, recirculating networks and multistage networks. Important network
classes to be presented include the Illiac network, the flip network, the n-cube network, the
omega network, the data manipulator, the barrel shifter, and the shuffle-exchange network.
Formally, such an inter-PE communication network can be specified by a set of data routing
functions. To pass data between PEs that are not directly connected in the network, the data
must be passed through intermediate PEs by executing a sequence of routing functions
through the interconnection network.
The SIMD interconnection networks are classified into the following two categories based on
network topologies: static networks and dynamic networks.
Staic networks Topologies in the static networks can be classified according to the
dimensions required for layout.
For illustration, one dimensional, two dimensional, three dimentional, and hypercube.
Examples of one dimensional topologies include the linear array used for some pipeline
architectures.
two dimensional topologies include the ring, tree, star, mesh and systolic array. Examples of
these structures are shown in figures.
Three dimensional topologies include the completely connected chordal ring,3 cube,and 3-
cube-connected cycle networks. The mesh and the 3 cube are actually two and three
dimensional hypercube respectively.
The single stage network is also called a recirculating network. Data items may have to
recirculate through the single stage several times before reaching their final destinations. The
number of recirculations needed depends on the connectivity in the single stage network. In
general, the higher is the hardware connectivity, the less ia the number of recirculations.
Multistage network Multistage network are described by three characterizing features:
Many switch boxes are used in a multistage network. Each box is essentially an interchange
device with two inputs and two outputs as depicted in figure.
There are four states of a switch boxes : straight, exchange, upper broadcast, lower broadcast.
1. straight:-
2. exchange:-
3. upper broadcast
4. lower broadcast:-
A two-function switch box can assume either the straight or the exchange states. A four-
function switch box can be in any one of the four legitimate states.
The one-side networks, sometimes called full switches, have input output ports on the same
side.
2
Connection
3
network
N
The two-sided multistage networks, which usually have an input side and an output side,
can be divided into three classes: blocking, rearrangable, and nonblocking.
double network
1 1
-
CONNECTIO
2 2
N NETWORK
3 3
Blocking networks:-
In blocking networks, simulataneous connections of more than one terminal pair may results
in conflicts in the use of network communication links. Examples of a blocking network are
the data manipulator, omega, flip,n cube, and baseline.
rearrangeable network:
nonblocking network:-
A network which can handle all possible connections without blocking is called a
nonblocking network.
A single-stage recirculating network has been implemented in the Illiac-IV array processor
with N=64 PEs. Each PEi is allowed to send data to any one of PEi+1,PEi-1,PEi+r,andPEi-r where
r=N in one circulation step through the network. Formally, the Illiac netwok is characterized
by the following four routing functions.
Each PEi in fig is directly connected to its four nearest neighbors in the mesh network.
In terms of permutation cycles, we can express the above routing functions as follows:
Horizontally, all the PEs of all rows from a linear circular list as governed by the following
two permutations, each with a single cycle of order N. The permutation cycles (a b c) (d e)
stand for the permutation a b, b c ,c a and d e, e d in a circular fashion within
each pair of parenthesis:
R+1 = (0 1 2………N-1)
R-1 = (N-1……...2 1 0)
For the example network of N=16 and r=4, the shift by a distance of four is specified by the
following two permutations, each with four cycles of order four each:
The Illiac network is only a partially connected network. figure shows the connectivity of the
example illiav network with N=16.This graph shows that four PEs can be reached from any
PE in one step., seven PEs in two steps, and eleven PEs in three steps.
Vertical lines connect vertices(PEs) whose addresses differ in the most significant bit
position. Vertices at both ends of the diagonal lines differ in the middle bit position.
Horizontal lines differ in the least significant bit position. This unit cube concept can be
extended to an n-dimensional unit space, called an n cube, with n-bits per vertex.
We shall use the binary sequence A = (an-1……a2 a1 a0) to represent vertex (PE) address for
0<=i<=n-1.The complement of bit will be denoted as ai for any 0<=i<=n-1.
In the n-cube, each PE located at the corner is directly connected to n-neighbors. The
neighboring PEs differ in exactly one bit position.
The implementation of a single stage cube network is illustrated in figure for N=8.
The interconnections of the PEs corresponding to the three routing functions C0,C1,C2 are
shown separately. If one assembles all three connecting patterns together, the 3 cube shown
in figure should be a result.
The same set of cube-routing functions, C0,C1,C2 can also be implemented by a three stage
cube network, as modeled in figure for N=8.
Two-function (straight and exchange) switch boxes are used in constructing multistage cube
networks. The stages are numbered as 0 at the input end and increased to n-1 at the output
end. stage i implements the Ci routing function for i=0,1,2,…….n-1.This means that switch
boxes at stage I connect an input line to an output line that differs from it only at the ith bit
position.
Barrel shifters are also known as plus-minus-2i (PM2I) networks. This type of network is
based on the following routing functions:
B+0 = R+1
B-0 = R-1
B+n/2 = R+r
B-n/2 = R-r
This implies that the illiac routing functions are a subset of the barrel-shifting functions. In
addition to adjacent (+-1) and fixed-distance (+-r) shiftings.
The barrel-shifting functions allow either forward or backward shifting of distances, each PE
in a barrel shifter is directly connected to 2(n - 1)PEs.therefore, the connectivity in a barrel
shifter is increased from the Illiac network by having (2n – 5).
The barrel shifter can be implemented as either a recirculating a single stage network. Figure
shows the interconnection patterns in a recirculating barrel shifter for N=8.The barrel shifting
functions B are executed by the interconnection patterns shown.
A barrel shifter has been implemented with multiple stages in the form of a data manipulator.
as shown in figure, the data manipulator consists of N cells. each cell is essentially a
controlled shifter. This network is designed for implementing data manipulating functions
such as permuting, replicating, spacing, masking and complementing.
To implement a data manipulating functions, proper control lines if six groups (u12i u22i h12i h22i
d12i d22i) in each column must be properly set through the use of the control register and the
associated decoder.
The class of shuffle-exchange networks is based on two routing functions shuffle(S) and
Exchange(E).Let A = an-1…a1a0 be a PE address:
S(an-1…a1a0) = an-2…a1a0an-1
Where 0<=A<=N-1 and n=log2N. The cyclic shifting of the bits in A to the left for one bit
position is performed by the S function. This action corresponds to perfectly shuffling a deck
of N cards, as demonstrated in figure.
The perfect shuffle cuts the deck into two halves from the center and intermixes them
evently. The inverse perfect shuffle does the opposite to restore the original ordering. The
exchange-routing function E is defined by:
E(an-1…a1a0) = an-1…a1a0
The complementing of the least significant digit means the exchange of data between two
PEs with adjacent addresses.
The use of recirculating shuffle-exchange network for parallel processing was proposed by
stone.
The shuffle-exchange functions have been implemented with the multistage omega network
by Lawroe. The omega network for N=8 is illustrated in figure. An N * N omega network
consists of n identical stages. Between two adjacent stages is a perfect-shuffle
interconnection. Each stage has N/2 switch boxes under independent box control.
each box has four functions (straght, exchange, upper broadcast, lower broadcast), as
illustrated in figure. The switch boxes in the Omega network can be repositioned as shown in
figure without violating the perfect-shuffle interconnections between stages.
Chapter:-4
Parallel computer:-
Loosely coupled:-
Tightly coupled:-
common memory.
2. Each PE in the loosely coupled has its own private local memory .
3. The processor are tight together by a switching scheme designed to route the
information from one processor to another through a message passing scheme.
4. Loosely coupled network is efficient when interaction between task is minimum.
5. Tightly coupled is efficient where in real time or high speed.
Uniform Memory Access (UMA) is a shared memory architecture used in parallel computers.
All the processors in the UMA model share the physical memory uniformly. In a UMA
architecture, access time to a memory location is independent of which processor makes the
request or which memory chip contains the transferred data.
Uniform Memory Access computer architectures are often contrasted with Non-Uniform
Memory Access (NUMA) architectures.
In the UMA architecture, each processor may use a private cache. Peripherals are also shared
in some fashion, The UMA model is suitable for general purpose and time sharing
applications by multiple users. It can be used to speed up the execution of a single large
program in time critical applications.[1]
1. This bus system is totally passive with low active component like switches.
2. Conflict resolution method such as fixed priorities, FIFO queue and daisy Channing
are device for efficient utilization of resource.
3. Time shared is further divided into different sub categories:
a. Single bus multiprocessor
b. Unidirectional buses
c. Multibus multiprocessor organization
4. It has lowest overall system cost for the network and is least complex.
5. It is very easy to physically modify the new configuration.
Disadvantage :-
1. It is the most complex interconnection system and has the highest data transfer rate.
2. The functional unit are the highest and cheapest.
3. Switches are used for routing the request to memory and other ports.
4. This organisation is usually cost effective for multiprocessor only.
5. System expansion (addition of function unit) usually improve the overall
performance.
6. Reliability of switch and system can be improved by segmentation within the
switches.
Non-Uniform Memory Access
Basic concept
One possible architecture of a NUMA system. Notice that the processors are connected to the
bus or crossbar by connections of varying thickness/number. This shows that different cpus
have different priorities to memory access based on their location.
Modern CPUs operate considerably faster than the main memory to which they are attached.
In the early days of high-speed computing and supercomputers the CPU generally ran slower
than its memory, until the performance lines crossed in the 1970s. Since then, CPUs,
increasingly starved for data, have had to stall while they wait for memory accesses to
complete. Many supercomputer designs of the 1980s and 90s focused on providing high-
speed memory access as opposed to faster processors, allowing them to work on large data
sets at speeds other systems could not approach.
Limiting the number of memory accesses provided the key to extracting high performance
from a modern computer. For commodity processors, this means installing an ever-increasing
amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid
"cache misses". But the dramatic increase in size of the operating systems and of the
applications run on them has generally overwhelmed these cache-processing improvements.
Multi-processor systems make the problem considerably worse. Now a system can starve
several processors at the same time, notably because only one processor can access memory
at a time.
NUMA attempts to address this problem by providing separate memory for each processor,
avoiding the performance hit when several processors attempt to address the same memory.
For problems involving spread data (common for servers and similar applications), NUMA
can improve the performance over a single shared memory by a factor of roughly the number
of processors (or separate memory banks).
Of course, not all data ends up confined to a single task, which means that more than one
processor may require the same data. To handle these cases, NUMA systems include
additional hardware or software to move data between banks. This operation has the effect of
slowing down the processors attached to those banks, so the overall speed increase due to
NUMA will depend heavily on the exact nature of the tasks run on the system at any given
time.
Nearly all CPU architectures use a small amount of very fast non-shared memory known as
cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache
coherence across shared memory has a significant overhead.
Typically, this takes place by using inter-processor communication between cache controllers
to keep a consistent memory image when more than one cache stores the same memory
location. For this reason, ccNUMA performs poorly when multiple processors attempt to
access the same memory area in rapid succession. Operating-system support for NUMA
attempts to reduce the frequency of this kind of access by allocating processors and memory
in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make
NUMA-unfriendly accesses necessary. Alternatively, cache coherency protocols such as the
MESIF protocol attempt to reduce the communication required to maintain cache coherency.
Current[when?] ccNUMA systems are multiprocessor systems based on the AMD Opteron,
which can be implemented without external logic, and Intel Itanium, which requires the
chipset to support NUMA. Examples of ccNUMA enabled chipsets are the SGI Shub (Super
hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and
those found in recent NEC Itanium-based systems. Earlier ccNUMA systems such as those
from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7)
processor.
Intel announced NUMA introduction to its x86 and Itanium servers in late 2007 with
Nehalem and Tukwila CPUs[citation needed]. Both CPU families will share a common chipset; the
interconnection is called Intel Quick Path Interconnect (QPI).
One can view NUMA as a very tightly coupled form of cluster computing. The addition of
virtual memory paging to a cluster architecture can allow the implementation of NUMA
entirely in software where no NUMA hardware exists. However, the inter-node latency of
software-based NUMA remains several orders of magnitude greater than that of hardware-
based NUMA.
In a typical SMP (Symmetric MultiProcessor architecture), all memory access are posted to
the same shared memory bus. This works fine for a relatively small number of CPUs, but the
problem with the shared bus appears when you have dozens, even hundreds, of CPUs
competing for access to the shared memory bus. This leads to a major performance
bottleneck due to the extremely high contention rate between the multiple CPU's onto the
single memory bus.
The NUMA architecture was designed to surpass these scalability limits of the SMP
architecture. NUMA computers offer the scalability of MPP(Massively Parallel Processing),
in that processors can be added and removed at will without loss of efficiency, and the
programming ease of SMP where.
Load balancing:-
Load balancing which is used to distribute computation fairly across processor in order to
obtain the highest possible execution speed.
It was discussed that process were simply distributed amoung the avalible processor without
any discussion of type of processor and their speed.
However ot may that some processor will complete their task before the other and become
idlebecause the work is unevenly divided or some processor operate faster than other.
Load balancing is particular useful when the amount of work is known before to execution.
It is impossible to estimate the accurate execution time of various part of progress without
actually execution of the parts.
All the factor are taken into account by making the division of load depand uoon the
execution of parts as they are being executed.
1. Centralized LBA
2. Decentralized LBA
3. Semicentrlized LBA
4. Centralized load balancing:
1. In this master process hold the collection of task to be performed. Tasks to be sent to
the salve processor when a salve master complete one task it will request another task
from the master process.
Master queue
process
receive task
send task
2. it makes a global decision about the relocation of the work of processor. Some
centralized algorithm assign the maintance system global state information
to a particular node.
5. Global state information can allow the algorithm to do good job of balancing the work
among the process.
Disadvantage:
process
Receiving task
send task
mini master
salve
salve salve salve
b. EACH PROCESSING ELEMENT holds its own information about its state.
These processing element are free to interact with each other and also to balance
the load .
c. The interchange of data will take place if one processor has more and other has
less data.
Disadvantage:
If proper distribution of load as work load may not balance as in case of centralized
load balance algorithm.
A heavily load process pass out some of its task to other that are willing to accept them.
A process request take from other process it select .when a process request task from another
process when it has free and no task to perform.
Process scheduling:-
It is the allocation of task to the processor. Scheduling is illustrated by Gantt chart indicating
the time each task spend as well as the processor on which it execute. So we can say that that
scheduling is division of work among the processes
Types of scheduling:-
1. Static scheduling
2. Dynamic scheduling
Static scheduling:-
The technique to separate dependent instruction and minimize the number of actual
hazard and resultant stalls is called static scheduling. It became popular with pipelining.
1. Static scheduling sometimes result in lower execution times than dynamic scheduling.
2. Static scheduling can be used to predict the speed up that can be achieved by a
particular parallel algorithm on a target machine assuming no pre-emption of process
occurs.
3. Static scheduling can allow the generation of only one processor per processor,
reducing process creation, synchronization, and termination overhead.
Dynamic scheduling:
The technique where the hardware rearrange the instruction execution to reduce the stalls.
Dynamic scheduling offer several advantage.
1. It simplifies the complier.
2. It allow code that was compiled with one pipeline in mind to run efficiently on a
different pipeline.
Scheduling algorithm:
1. Deterministic model
2. Graham’s list scheduling algorithm
3. Coffman-graham scheduling algorithm
1. Deterministic model:
Parallel algorithm is collection of task , some of which can be completed before other begin .
In deterministic model the precedence relation and the execution time between task is
predefined or known before the run time .
For example consider the tasks graph illustrated in the figure given below. We are given a set
of seven tasks and their precedence relation (information about what task competed before
other tasks can be started).
T1
T2 T4
T3
T5 T6
T7
TASK GRAPH.
In figure each node represent a task to be performed. A directed are Tia toTj indicated that
the task Ti must completed before Tj starts.
The allocation of tasks to the processor is done by a schedule. Gantt charts are best example
to explain the schedules. A Gantt chart indicated the time each task spend in the execution as
well as the processor on which it executes.
T4
T3 T6
T1 T2 T5
T7
1 2 3 4 5 6 7 8 9
T2
T1
T6
T3 T5
T4 T8
T7
T9
Algorithm steps:-
1. Choose an arbitrary tasks Tk from T Such that S(Tk)=Ф and define α(Tk)=1
2. Fori 2 to n do
a. R be the set of unlabeled tasks with no unlabeled successor.
b. Let T* be the tasks in R such that N(T*) is lexicographically smaller than
N(T) for all T and R.
c. Let α(T*) i
3. Construct a list of task L=(Un, Un-1,.................U2, U1) such that α(Ui)=i for all i
where 1<\i<\n.
4. Given (Ti<L) Use graham’s scheduling algorithm to schedule the tasks in T.
T1
T2
N1 N2
T3
N3
T4 N4
N5 T5
G1=NIN3N5+N2N3N5+N4N5
= [ (NI+N2)N3+N4]N5
T3 T3
T3 T3
G3=X1X3+X1X4+X2X4
G is said to be simple of polynomial Xa can be factor so that each variable appear
exactly once.
Chapter 5 &6
Parallel Algorithms:-
This algorithm design for sequential computers are known as sequential algorithm.
The algorithm which is used to solve a problem of parallel computer are known
as parallel computer.
Parallel algo defined how the problem is divided into sub problem ,how the
processor communicate and how the partial solution into combined to produce the
final result.
Parallel algo depends upon the type of parallel computer they designed for .
In order to simply design and analysis of parallel algo. Parallel computer are
represented by various abstract machine model. these machine model try to
capture the important features of parallel computer .
Assumption:-
1. in designing algo for these model are learn about the inherit parallism in the
parallel.
2. These models helps us to compare the relative computational power of various
parallel computer.
3. These model helps us to determine the kind of parallel architecture that is best
suited for a given problem.
Model of computation:-
Ram (random access memory)
This machine model of computing abstracts the sequential computer.
Dia. Of ram
1.Memory unit :-
A memory unit has M location where M can be unbounded.
2.) PROCEESOR :-
A Processor that operate under the control of sequential algorithm
. the processor can read and write to a memory location and can perform
basic arithmetic and logic operation(ALU).
3.) MAC (Memory access unit ):-
It creates a path from the processor to arbitrary location
in the memory . the processor provide the memory access unit with the
address of location that it wish to access and the operation it wants to
perform. The memory access unit use that address to establish a direct
connection between the processor and the memory location.
Algorithm for RAM consist of following steps:
1. Read :-
The processor read the data from memory and stores the data into
its local register.
2. Execute :-
the processor perform a basic arithmetic or logic operation on the
content on the one or two of its local register .
3. Write :-
The processor write the contents of one register into an arbitrary memory
location.
The PRAM is one of the popular model for degining parallel algo.
P1 Memory
Access Shared
memory
P2 Unit
(MAU)
PN
Dia . of PRAM model
Read:-
Compute:-
1. EREW
2. CREW
3. ERCW
4. CRCW
EREW:-
EREW stands for exclusive read exclusive write. in this model every
access to memory location (read &write) has to be exclusive .it means read and
write operation are not allowed .this model provide least amount of concurrency
and therefore the weakest model.
2. CREW:-
CREW stands for concurrent read exclusive write PRAM. In this model
only write operations to a memory a location are exclusive .concurrent read is
allowed means two or more processor can concurrently read from same memory
location. This is one of most commonly used model.
3.ERCW:-
ERCW stands for exclusive read concurrent write PRAM Model. In
this model read operation are exclusive. this model allows multiple processor to
write concurrently to same memory location .this model is not frequently used and
is defined hare for the shake of completeness .
4.CRCW:-
CRCW stands for concurrent read and concurrent write PRAM model.This
model allows multiple processor to read and write to a memory location . it provide
maximum amount of concurrency ,
This is most powerful model among four memory model .
Types of CRCW:-
There are several protocol that are used to specify the value that is written to a
memory in situation where many processor try to write different memory location
and the model has to specify the values that is written to a memory location.
3.Aribitary CRCW:-
4.combining CRCW:-
In this there is function that maps the multiple values that the processor
try to write to a single values that is actually written into the memory location.
Interconnection network:-
In pram memory model exchange of data between processor take place either through shared
memory or by the direct links connecting the processor .
Combinational circuit is viewed as a device that has s set of input lines at one end and set
of output [o/p] lines at another end .
stage
Inter
Connection
network
1. Running time
2. No. Of processor
3. Cost.
Relative strengths/power /features of PRAM Model:-
1. EREW PRAM:- EREW PRAM Model has least concurrency and is therefore the
weakest model.
2. CREW PRAM:- A CREW PRAM model is the most commonly used model than the
EREW PRAM model.
3. ERCW PRAM:- A ERCW PRAM is never used because it is impossible to write
concurrently at same location .
4. CRCW PRAM:- A CRCW PRAM is strongest model and it is widely used.
Chapter 6
PRAM ALGORITHMS
4 3 8 2 9 1 0 5 6
3
7
Array representation of parallel reduction
A(0) A(1) A(2) 3 4 5 6 7 8
9
4 6 1 2 8 3 0 5 7 3
10 3 12 5 10
13 17 10
30 10
40
Complexity of this algorithms is (log n) for n/2 processor .the complexity is overall
time complexity.
Given set of n values a1, a2, a3, a4,.............an and an associative operation +,the
prefix sum problem is to complete n-quantities.
A1
A1+a2
A1+a2+a3
.
.
A1+a2+..........+an
For example:- the operation “+” on array (3,10,4,2) the prefix sum should be
1) A1=3
2) A1+a2=13
3) A1+a2+a3=3+10+4=17
4) A1+a2+a3+a4=3+10+4+2=19
So array will be =(3, 13, 17, 19)
PRAM Algorithms :-
To find prefix sum n element list using n-1 processor.
Prefix sum (CREW PRAM)
Initial condition:-list of n>element stored in A[0,1........(n-1)]
Final condition:-sum of elements stored in A[0]+A[1]+A[2]..........+A[i]
Global variable:- n , A[0,..............(n-1)] , j
Begin
Spawn (P0,P1,P2....................Pn-1)
For all Pi where 0<\i<\n-1 do
For j-0 to [log n]-1 do
If i-2^j>/0 then
A[i]-A[i]+A[i-2^j]
End if
End for
End for
End
Let us take another problem to separate upper case letters from lower case and
yet maintaining the order .keeping the upper case letter first in the array A.
Let us take an array A of n letter and suppose with elements .
A b C D e F g h
Array A
We also have another array T of same size so that we get the index of each element in
array. Putting “1” where ever we find upper case letter and “0” where ever we find
lower case letter.
1 0 1 1 0 1 0 0
1 1 2 3 3 4 4 4
Here we will take single value if value is repeated e.g. “1” then we will take the first
coming value i.e. A for upper case.
To show corresponding values we will take 2-c , 3 is repeated so we will take first value
of D and on.
1 2 3 4 ............
A C D F b e g h
A[0] A[1] 2 3 4 5 6 7 8 9
4 6 1 2 8 3 0 5 7 3
4 10 7 3 10 11 3 5 12 10
4 10 11 13 17 14 13 16 15 15
4 10 11 13 21 24 24 29 32 29
4 10 11 13 21 24 24 29 36 39
From step 0 to 1 :
The addition operation is performed with first right neighbour and the neighbour values is
updated.
From step 1 to 2:
The addition operation is performed with second neighbour and neighbour value is updated.
From step 2 to 3:
From step 3 to 4:
The suffix sum problem is a variant of prefix sum problem, where an array is replaced by
linked list. And the sum are computed from end rather than from the beginning.
Final condition: values in array position contain original distance of each element from end of
list .
Begin
If next[i]=i then
Position [i] 0
Else
Position[i] 1
End if
For j 1 to [log n] do
Parallel[i]=position[i]+position[next[i]]
Next[i] next(next[i])
End for
End for
End
Parallel merge :
Given two sorted list of n/2 elements each stored in A[1]..........A[n/2] and A[(n/2).......A(n)]
Global A[1......n]
Begin
Spawn(P0,P1................Pn)
If i<n/2 then
Low n/2+1
High n
Else low 1
High n/2
End if
If x A[i]
Repeat
Index [(low+high)/2]
If x <A[index] then
High index-1
Else
Low index+1
End if
Until low>high
A[high+i-n/2] x
End for
End
In PRAM the complexity reduce as o(log n) as compare to RAM algorithms .
One processor is assign to each element which determine the position assign to the element in
the final merge list. index is there it gives the position of element. n processor are required
for n elements .upper list processing element will access lower list and vice versa.
1 5 7 9 19 24 14 17
1 3 5 2 7 9 11 19 22 24 14 17
3 2 11 22
Every processor find the position of its own element on the other list using binary search
because an element index is in own merge list is known and is position on the merge list can
be completed when its index is found and then two index is added.
We can say that parallel algo. having same complexity with optimal RAM algorithms having
cost. This can be termed as cost optimal algorithm.
Cost optimal parallel reduction algo. has a time complexity (log n) for n processor.
2. After determine no of processor we need to verify that whether a cost optimal parallel
reduction algorithm with (log n) complexity exist or not .this is done by Brent’s theorem.
The amount of work an algorithm performs is the run time of algorithm multiplied by the
number of processor it uses .a conventional (sequential) algo. may be thought of as a parallel
algorithm designed for one processor. An algorithm is said to be cost optimal if the amount of
work it does is the same as the best known sequential algorithm.
Brent’s theorem:-
T= t + (m-t)/p
This typical application is when p<the number of processor which gave rise to the time t
measure.
Note that where p=1 Brent’s =m(the processor is executing in sequential order one after
another .)
Brent’s theorem specifies for a sequential algorithm with t time steps and a total of m
operation that a run time T is definitely possible on a shared memory machine with p
processor . there may be an algorithm that solve this problem faster or it may be possible to
implement this algorithm faster (by scheduling ) instruction differently to minimize idle
processor , for instance),but it is definitely possible to implement this algorithm in this time ,
given p processor.
Key to understanding Brent’s theorem is understanding time steps .in single time step every
instruction that has no dependencies is executed and therefore t is equal to the length of the
longest chain of instruction that depend on the result of other instruction that depend on the
result of other instruction (as any shorter chains will be finished executing by the time ,the
longest chain has).
Using this algorithm each add operation depend on the result of the previous one forming a
chain of length n thus t=n. there are n operation so m=n.
T=n +o/p
(m-t)=0. So no matter how many processor are available this algorithm will take time n.
1. No matter how many processor are used .there can be no implementation of this
algorithm that can be faster than (log n).
2. If we have n processor the algorithm can be implemented in (log n) times .
3. If we have log (n) processor the algorithm can be implemented in 2log(n)times.
4. If we have one processor the algorithm can be implemented in n time.
If we consider the amount of work done in each case .with one processor we do n
work ,with log n processor we do n work but with n processor we do nlog(n)
work.
So the implementation with with 1 or log (n) processor therefore cost optimal
while the implementation with n processor is not cost implementation..
Brent’s theorem does not tell how to implement parallel algorithm but it tell
what is possible.
NC Algorithm:-
The class NC is set of languages decidable in parallel time T.
NC is the class of problem solvable on a pram in poly logarithmic time using a
number of processor that one polynomial function of problem size.
If some algorithm is in NC, it remains in NC regardless in which PRAM sub model
we assume.
The class NC include:
1. Parallel prefix computation.
2. Parallel sorting and selection.
3. Matrix operation.
4. Parallel tree contraction and expression evaluation.
5. Parallel algo. For graph.
6. Parallel algo. For computation geometry.
7. Parallel graph for biconnectivity and triconnectivity
8. Parallel string matching algorithm.
Many NC algo. Are cost optimal that is they have T(n,p(n)= o(log n)
P(n)=n/log n
N=o(n)
O=complexity notation .
Drawback of NC theory :-
1. the class NC many include some algorithm which are not efficiently
parallelizable .the most infamous example is parallel binary search.
2. NC theory assumes situation where a huge machine solve very quickly moderately
size problem.
However in actual moderately size machine are used to solve a large problem so that
the no of processor tend to be polynomial.
Chapter:- 7
For j = 1 to n Do
Par for k = 1 to n Do
For j = 1 to n Do
Par for k = 1 to n Do
End of j loop
End of I loop
Vector load operation is performed to initialize the row vectors of matrix C one row at a
time.In the vector multiply operation, the same multiplier aij is broadcast from the CU to all
PEs to multiply all n elements of the ith row vector of B.
Two time measures are needed to estimate the time complexity of the parallel-sorting
algorithm. Let tR be the routing time required to move one item from a PE to one of its
neighbours, and tC be the comparison time required for one comparison step. This means that
a comparison-interchange step between the two items in adjacent PEs can be done in 2tR + tC
time units.
The sorting problen depends on the indexing schemes on the PEs. The PEs may be indexed
by a bijection from {1,2…….,n} 8 {1,2……..,n} to {0,1,….,N-1},where N = n2.Three
indexing patterns formed after sorting the given array in part a with respect to three different
ways for indexing the PEs. The pattern in part b corresponds to a row-majored indexing, part
c corresponds to a shuffled row-major indexing , and is based on a snake-like row-major
indexing.
Shuffle and unshuffle operations can each be implemented with a sequence of interchange
operations .Both the perfect shuffle and its inverse can be done in K – 1 interchanges or 2(K
– 1) routing steps on a linear array of 2K PEs.
Parallel algorithm for multiprocessors is a set of K concurrent processes which may operate
simultaneously and cooperatively to solve a given problem.
Interaction points:- interaction points are those points where processes communicate with
other processes. These interaction points divide process into stages.
1.synchronized algorithm:- Parallel algorithms in which some processes have to wait on other
processes are called synchronized algorithms. In these type of algorithms, processes have the
property that there exist a process such that some stage of the process is not activated until
another has completed certain stage of its program. Execution time of processes is a variable
which depends on input data and system interruptions.
2.Asynchronous algorithm:-In asynchronous algorithm ,processes are not required to wait for
each other and communication is achieved by reading dynamically updated global variables
stored in a shared memory. There is a set of global variables accessible to all the processes
when a stage of a process is completed .Based on the values of the variable read together with
the results just obtained from the last stage. The process modifies a subset of global variable
and then activates the next stage or terminate itself. In some cases, operations on global
variables are programmed as critical section.
The main characteristics of an asynchronous parallel algorithm is that its processes never wait
for inputs at any time but continue execution or terminate according to whatever information
is currently contains in global variables.
Alternative approach
Macro pipelining:- It is applicable if the computation can be divided into parts called stages
so that the output of one or several collected parts is the input for another parts. In this case as
each computation part is realized as a separate process, communication cost may be high.
In Static decomposition:- Set of processes and their precedence relation are known before
execution.