Parallel and Distributed Computing
Chapter 3: Models of Parallel Computers and
Interconnections
Jun Zhang
Laboratory for High Performance Computing & Computer Simulation
Department of Computer Science
University of Kentucky
Lexington, KY 40506
Chapter 3: CS621 1
3.1a: Architecture of Theoretical Parallel
Computer
Parallel Random Access Machine (PRAM) is a
theoretical model of parallel computer, with
1.) p identical processors
2.) a global memory of unbounded size
3.) memory is uniformly accessible to all processors
Processors share a common clock
They may execute different instructions in each cycle
There are four subclasses of PRAM, based on the
memory access protocols
Chapter 3: CS621 2
3.1b: Illustration of the PRAM Model
Chapter 3: CS621 3
3.1c: PRAM Subclasses
Exclusive-read, exclusive-write (EREW) PRAM: Access to a
memory location is exclusive. No concurrent read or write
operations are allowed
The weakest PRAM model, affording minimum concurrency in
memory access
Concurrent-read, exclusive-write (CREW) PRAM: Multiple read
accesses to a memory location is allowed. Multiple write
accesses to a memory location is serialized
Exclusive-read, concurrent-write (ERCW) PRAM: Multiple write
accesses are allowed to a memory location. Multiple read
accesses are serialized
Concurrent-read, concurrent-write (CRCW) PRAM: Both
multiple read and multiple write accesses to a memory location
are allowed
This is the most powerful PRAM model
Chapter 3: CS621 4
3.1d: PRAM Semantics
Concurrent read access to a memory location by all
processors is OK.
Concurrent write access to a memory location
presents semantic discrepancy and requires
arbitration
The most frequently used arbitration protocols are:
Common: Concurrent write is allowed if all the writing
processors have the same value
Arbitrary: An arbitrary processor is allowed to
proceed with the write operation, and the rest fail
Priority: Processors are prioritized a priori, the
processor with the highest priority writes and others
fail
Sum: The sum of all the quantities is written
Chapter 3: CS621 5
3.2: Processor Granularity
Coarse-grained: Few powerful processors
Fine-grained: Many less powerful processors
Medium-grained: between the above two
The granularity definition is relative
Another definition of granularity is with respect to the
relative rates of communication to computation
Fine-grained: shorter duration between
communication
Coarse-grained: longer duration between
communication
Chapter 3: CS621 6
3.3a: Interconnection Networks
Static networks: Processing nodes are
connected by point-to-point communication
links (direct networks)
Mostly used for message-passing computers
Dynamic networks: Communication links
are connected to one another dynamically by
the switches to establish paths among
processing nodes and memory banks
(indirect networks)
Mostly used for shared memory computers
Chapter 3: CS621 7
3.3c: Examples of Static Interconnections
Fully- connected 2D mesh with wraparound
Chapter 3: CS621 8
3.3d: Switch Board Connections
Chapter 3: CS621 9
3.3e: Dynamic Interconnection
Chapter 3: CS621 10
3.3f: Multistage Dynamic Interconnection
Chapter 3: CS621 11
3.3g: Switch Functionalities
A single switch has a set of input ports and a set of
output ports
The switch functionalities include:
1) a mapping from input to output ports (basic)
2) additional support for
internal buffering (when the requested output port
is busy)
routing (to alleviate network congestion), and
multicasting (same output on multiple ports)
The degree of a switch is the total number of ports
Chapter 3: CS621 12
3.3h: Cost of a Switch
The cost of a switch is influenced by the cost
of mapping hardware, the peripheral
hardware, and packaging costs
The mapping hardware typically grows as the
square of the degree of the switch
The peripheral hardware grows linearly as
the degree
The packaging costs grow linearly as the
number of pins
Chapter 3: CS621 13
3.3h: Network Interface (Static)
Network interface is to handle the connectivity
between the node and the network
Network interface has input and output ports that
pipe data into and out of the network
Its functionalities include:
1) packetizing data
2) computing routing information
3) buffering incoming and outgoing data
4) error checking
Chapter 3: CS621 14
3.3i: Approximation of Network Costs
For dynamic interconnection networks: Its
cost is proportional to the number of switches
used in the network
For static Interconnection networks: Its cost
is proportional to the number of links
Chapter 3: CS621 15
3.4a: Network Topologies
Multiple processors must be working together to
solve a common task
They must communicate during the course of
solving the task
The communication is provided by the
interconnection networks
How to connect multiple processors in a parallel
system --
This is a trade-off between cost and scalability with
performance
Chapter 3: CS621 16
3.4b: Bus-Based Networks
A bus-based network consists of a shared medium
that connects all nodes
The cost of a bus-based network scales linearly with
respect to the number of processors p
The distance between any two nodes in the network
is constant O(1)
Ideal for broadcasting information
Disadvantage: bounded bandwidth & blocking
Performance is not scalable with respect to the
number of processors p
Chapter 3: CS621 17
3.6b: Bus-Based Interconnect with Cache
Chapter 3: CS621 18
3.6c: Crossbar Network
A crossbar network uses a grid of switches or
switching nodes to connect p processors to b
memory banks
It is a non-blocking network
The total number of switching nodes is (pb)
In many cases, b is at least on the order of p, the
complexity of the crossbar network is (p*p)
Disadvantage: Switch complexity is difficult to realize
at high data rates
Scalable in terms of performance, but not scalable
in terms of cost
Chapter 3: CS621 19
3.6d: Crossbar Network (I)
Chapter 3: CS621 20
3.6e: Crossbar Network (II)
Chapter 3: CS621 21
3.6f: Multistage Networks
To balance the scalability between performance and
costs
Allowing multiple stages between processors and
memory banks
Switches are installed at each stage
It is more scalable than bus-based networks in
terms of performance
It is more scalable than the crossbar networks in
terms of costs
A special multistage interconnection network is the
omega network
Chapter 3: CS621 22
3.6g: Multistage Interconnection (I)
Chapter 3: CS621 23
3.6h: Multistage Interconnection (II)
Chapter 3: CS621 24
3.6i: Omega Network
Exactly log p stages
There are p processors in the network
There are p memory banks
Each stage of the omega network consists of
an interconnection pattern that connects p
inputs and p outputs
A link between input i and output j based on
the perfect shuffle, a left-rotation on the
binary representation of i to obtain j
Chapter 3: CS621 25
3.6j: Illustration of Omega Network
Chapter 3: CS621 26
3.6k: Perfect Shuffle
Chapter 3: CS621 27
3.6l: Four-Stage Omega Network
Chapter 3: CS621 28
3.6m: Two Connection Modes
At each stage, a perfect shuffle
interconnection pattern feeds into a set of p/2
switches
Each switch is in one of the two connection
modes:
Pass-through: the inputs are sent straight
through to the outputs
Cross-over: the inputs to the switching
nodes are crossed over and sent out
Chapter 3: CS621 29
3.6n: Switch (Pass-through and Cross-
over)
Chapter 3: CS621 30
3.6o: Cost and Communication of Omega
Network
The cost of an omega network is (p log p)
Data routing scheme in an omega network:
1) Let s be the binary representation of the
processor to communicate to a memory bank t
2) If the most significant bits of s and t are the same,
the data is routed in pass-through mode by the
switch; otherwise in cross-over mode
3) Traversing log p stages uses all log p bits in the
binary representations of s and t
This is a blocking network
Chapter 3: CS621 31
3.7: Completely-Connected Network
Each node has a direct
communication link to
every other node in the
network (non-blocking)
How many
communication links
are needed?
Scalable in terms of
performance, not
scalable in terms of
cost
Chapter 3: CS621 32
3.8: Star-Connected Network
One processor acts as the
central processor
Every other processor has a
communication link with this
processor
Congestion may happen at
the central processor
This is a blocking network
Scalable in terms of cost,
not scalable in terms of
performance
Chapter 3: CS621 33
3.9: Linear Array and Ring Networks
Scalable in terms of costs, not scalable in terms
of performance
Chapter 3: CS621 34
3.10: 2D Mesh Networks
2D mesh network
2D mesh network with
wraparound
Chapter 3: CS621 35
3.11: 3D Mesh Network
Many physical simulations can be mapped naturally to a
3D network topology. 3D mesh interconnection is common
Chapter 3: CS621 36
3.12a: Hypercube Networks
Chapter 3: CS621 37
3.12b: Labeling Hypercube Networks
Minimum distance between
2 processors is the number
of binary digits that are
different in their labels:
Hamming distance
Chapter 3: CS621 38
3.12c: 4D Hypercube
Chapter 3: CS621 39
3.12d: Partition Hypercube
A d dimensional hypercube can be
partitioned into two d-1 dimensional
hypercubes
There are d different partitions
Each partition takes nodes with fixed one bit
value
Hypercubes have many theoretical properties
that can be utilized to develop communication
algorithms
Chapter 3: CS621 40
3.13a: Tree-Based Networks
There is only one path between any pair of
nodes
Linear array and star-connected networks are
special cases of tree networks
Tree networks can be static or dynamic
In case of dynamic interconnection, the
intermediate level processors are switching
nodes and the leaf nodes are processing
elements
Chapter 3: CS621 41
3.13b: What is This Tree Network?
Chapter 3: CS621 42
3.13c: Static and Dynamic Tree Networks
Chapter 3: CS621 43
3.13d: Communication in Tree Networks
Messages from one half tree to another half
tree are routed through the top level nodes
Communication bottleneck forms at higher
levels of the trees
The solution is to increase the number of
communication links and switching nodes at
the higher levels
The fat tree is suitable for dynamic networks
Chapter 3: CS621 44
3.13e: Fat Tree Network
Chapter 3: CS621 45
3.14a: Evaluating Static Interconnection
Networks
There are several criteria to characterize the
cost and performance of static
interconnection networks
Diameter
Connectivity
Bisection Width
Bisection Bandwidth
Cost
Chapter 3: CS621 46
3.14b: Diameter of a Network
The diameter of a
network is the the
maximum distance
between any two
processing nodes in the
network
The distance between
two processing nodes
is defined as the
shortest path between
them
Chapter 3: CS621 47
3.14c: Diameters of Mesh Networks
Chapter 3: CS621 48
3.14d: Diameters of Hypercube and Tree
Chapter 3: CS621 49
3.15a: Connectivity of Networks
The connectivity of a network is a measure
of the multiplicity of paths between any two
processing nodes
The arc connectivity is the minimum number
of arcs that must be removed from the
network to break it into two disconnected
networks
A network with high connectivity has lower
contentions for communication resources
Chapter 3: CS621 50
3.15b: Connectivity of Mesh Array
Chapter 3: CS621 51
3.15c: Connectivity of Hypercube & Tree
Chapter 3: CS621 52
3.16a: Bisection Width & Channel Width
The bisection width is the minimum number
of communication links that must be removed
to partition the network into two equal halves
The channel width is the number of bits that
can be communicated simultaneously over a
link connecting two nodes
Channel width is equal to the number of
physical wires in each communication link
Chapter 3: CS621 53
3.16b: Channel Rate & Channel
Bandwidth
The peak rate a single physical wire can deliver bits
is called channel rate
The channel bandwidth is the peak rate at which
data can be communicated between the ends of a
communication link
Channel bandwidth is the product of channel rate
and channel width
The bisection bandwidth is the minimum volume of
communication allowed between any two halves of
the network
It is the product of bisection width and channel
bandwidth
Chapter 3: CS621 54
3.16c: Characteristics of Static Networks
Network Diameter Bisection Arc Number
Width Connect. of Links
Fully conn-ted 1 p2 / 4 p-1 p(p-1)/2
Star 2 1 1 p-1
Binary tree 2 log(( p + 1) / 2) 1 1 p-1
Linear array p-1 1 1 p-1
Ring |p-2| 2 2 p
2D mesh 2( p 1) p 2 2( p p )
2D meshwrap 2 p / 2 2 p 4 2p
Hypercube log p p/2 log p ( p log p ) / 2
Chapter 3: CS621 55
3.17: Cost of Static Interconnection
Networks
The cost of a static network can be defined in
proportion to the number of communication links or
the number of wires required by the network
Another measure is the bisection bandwidth of a
network
a lower bound on the area in a 2D packaging or the
volume in a 3D packaging
Definition is in terms of the order of magnitudes
Completely connected and hypercube networks are
more expensive than others
Chapter 3: CS621 56
3.18: Evaluating Dynamic Interconnection
Networks
Need consider both processing nodes and switching
units
Criteria similar to those used with the static
interconnection networks can be defined
The diameter is defined as the maximum distance
between any two nodes in the network
The connectivity is the maximum number of nodes
(or edges) that must fail to break the network
The cost of a dynamic network is determined by the
number of switching nodes in the network
Chapter 3: CS621 57