KEMBAR78
LECTURE 3 - Parallel Computing Platforms (PART 2) | PDF | Computer Network | Network Topology
0% found this document useful (0 votes)
36 views44 pages

LECTURE 3 - Parallel Computing Platforms (PART 2)

This lecture discusses the physical organization of parallel computing platforms, focusing on the architecture of the Parallel Random Access Machine (PRAM) and its subclasses based on memory access strategies. It also covers interconnection networks, their types, and topologies, including bus-based, crossbar, multistage, and tree-based networks, highlighting their scalability and performance trade-offs. Additionally, the document addresses communication costs in parallel machines, emphasizing the impact of network topology and message passing on performance.

Uploaded by

2024793147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views44 pages

LECTURE 3 - Parallel Computing Platforms (PART 2)

This lecture discusses the physical organization of parallel computing platforms, focusing on the architecture of the Parallel Random Access Machine (PRAM) and its subclasses based on memory access strategies. It also covers interconnection networks, their types, and topologies, including bus-based, crossbar, multistage, and tree-based networks, highlighting their scalability and performance trade-offs. Additionally, the document addresses communication costs in parallel machines, emphasizing the impact of network topology and message passing on performance.

Uploaded by

2024793147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CSC580

Parallel Processing
LECTURE 3:
Parallel Computing Platforms (PART 2)

PREPARED BY: SALIZA RAMLY


Topic Overview
This topic introduces the students:
◦ Physical Organization of Parallel Platforms
◦ Communication Costs in Parallel Machines

SALIZA RAMLY - CSC580


Physical
Organization of
Parallel Platforms

SALIZA RAMLY - CSC580


We begin this discussion with an
ideal parallel machine called
Parallel Random Access Machine
or
PRAM

SALIZA RAMLY - CSC580


ARCHITECTURE OF AN IDEAL PARALLEL
COMPUTER
o A natural extension of the Random Access Machine
(RAM) serial architecture is the Parallel Random
Access Machine, or PRAM.
o PRAMs consist of p processors and a global memory
of unbounded size that is uniformly accessible to all
processors.
o Processors share a common clock but may execute
different instructions in each cycle.
o A memory access unit (MAU) connects the processors
with the single shared memory.

SALIZA RAMLY - CSC580


ARCHITECTURE OF AN IDEAL PARALLEL
COMPUTER
Depending on how simultaneous memory accesses are handled, PRAMs can be divided into
four subclasses. (Accessing the same shared memory location simultaneously are resolved by one of
the following strategies)
Exclusive-read, exclusive-write • every memory cell can be read or written to by only one
(EREW) PRAM processor at a time

Concurrent-read, exclusive-write • multiple processors can read a memory cell but only one
(CREW) PRAM can write at a time

Exclusive-read, concurrent-write • multiple processors can write at memory cell but only one
(ERCW) PRAM can read at a time
Concurrent-read, concurrent-
• multiple processors can read and write. A CRCW PRAM is
write
sometimes called a concurrent random-access
(CRCW) PRAM
SALIZA RAMLY - CSC580
ARCHITECTURE OF AN IDEAL PARALLEL
COMPUTER
Allowing concurrent read access does not create any semantic discrepancies in the
program. However, concurrent write access to a memory location requires
arbitration. Several protocols are used to resolve concurrent writes. The most
frequently used protocols are as follows:
Common Arbitrary Priority Sum

Processor rank
All processors Only one arbitrary
indicates who gets
write the same attempt is Write the sum of
to write. Follow a
value; otherwise is successful, others all data items.
predetermined
illegal retire
priority order

SALIZA RAMLY - CSC580


PHYSICAL COMPLEXITY OF AN IDEAL
PARALLEL COMPUTER
o Processors and memories are connected via switches.
o Since these switches must operate in O(1) time at the level of words, for a
system of p processors and m words, the switch complexity is O(mp).
o For reasonable memory size, constructing a switching network of this
complexity is very expensive. Thus, PRAM models of computation are
impossible to realize in practice.

SALIZA RAMLY - CSC580


INTERCONNECTION NETWORKS FOR
PARALLEL COMPUTERS
o Interconnection networks carry data between processors and to memory.
o Interconnects are made of switches and links (wires, fiber).
o Interconnects are classified as static or dynamic:

Static • consist of point-to-point communication links among


networks processing nodes and are also referred to as direct networks.

Dynamic • are built using switches and communication links. Dynamic


networks networks are also referred to as indirect networks.

SALIZA RAMLY - CSC580


INTERCONNECTION NETWORKS
STATIC AND DYNAMIC
Classification of interconnection networks:

a static network a dynamic network

SALIZA RAMLY - CSC580


INTERCONNECTION NETWORKS:
NETWORK INTERFACES
o Processors talk to the network via a network interface.
o The network interface has input and output ports that pipe data into and out
of the network.
o The relative speeds of the I/O and memory buses impact the performance of
the network.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES
A variety of network topologies have been proposed and
implemented.

These topologies try to trade off cost and scalability with


performance.

Commercial machines often implement hybrids of multiple


topologies for reasons of packaging, cost, and available components.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(1)Bus-Based Networks
o Some of the simplest and earliest parallel machines used buses.
o All processors access a common bus for exchanging data.
o The distance between any two nodes is O(1) in a bus. The bus also provides a
convenient broadcast media.
o However, the bandwidth of the shared bus is a major bottleneck.
o Typical bus based machines are limited to dozens of nodes. Sun Enterprise
servers and Intel Pentium based shared-bus multiprocessors are examples of
such architectures.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(1)Bus-Based Networks
Bus-based interconnects (a) with no local caches; (b) with local memory/caches.

Since much of the data accessed by processors is local to the processor,


a local memory can improve the performance of bus-based machines.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(2)Crossbar Networks
o A simple way to connect p processors to b memory banks is to use a crossbar
network.
o A crossbar network employs a grid of switches or switching nodes.
o The crossbar network is a non-blocking network in the sense that the
connection of a processing node to a memory bank does not block the
connection of any other processing nodes to other memory banks.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(2)Crossbar Networks
o The total number of switching nodes required
to implement such a network is Q(pb).
o It is reasonable to assume that the number of
memory banks b is at least p; otherwise, at any
given time, there will be some processing nodes
that will be unable to access any memory
banks.
o As the number of processing nodes becomes
large, this switch complexity is difficult to
realize at high data rates.
o Consequently, crossbar networks are not very
scalable in terms of cost. A completely non-blocking crossbar network
connecting p processors to b memory banks.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(2)Crossbar Networks
o The cost of a crossbar of p processors grows as O(p2).
o This is generally difficult to scale for large values of p.
o Examples of machines that employ crossbars include the Sun Ultra HPC 10000
and the Fujitsu VPP500.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(3)Multistage Networks
o Crossbars have excellent performance scalability but poor cost scalability.
o Buses have excellent cost scalability, but poor performance scalability.
o Multistage interconnects strike a compromise between these extremes.
o It is more scalable than the bus in terms of performance and more scalable than
the crossbar in terms of cost.
o A multistage interconnect network is formed by cascading multiple single stage
switches.
o The switches can then use their own routing algorithm or controlled by a
centralized router, to form a completely interconnected network.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(3)Multistage Networks
o Multistage Interconnect Network can be classified into three types:
• A non-blocking network can connect any idle input to any idle output,
Non-blocking: regardless of the connections already established across the network. Crossbar
is an example of this type of network.

• This type of network cannot realize all possible connections between inputs
Blocking: and outputs. This is because a connection between one free input to another
free output is blocked by an existing connection in network.

Rearrangeable • This type of network can establish all possible connections between inputs and
non-blocking: outputs by rearranging its existing connections.

o The number of switching elements required to realize a non-blocking network in highest,


followed by rearrangeable non-blocking. Blocking network uses least switching elements.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(3)Multistage Networks

The schematic of a typical multistage interconnection network


SALIZA RAMLY - CSC580
NETWORK TOPOLOGIES:
Multistage Omega Networks
o One of the most commonly used multistage interconnects is the Omega
network.
o This network consists of log p stages, where p is the number of inputs/outputs.
o At each stage, input i is connected to output j if:

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Multistage Omega Networks
Each stage of the Omega network implements a perfect shuffle as follows:

A perfect shuffle interconnection for eight inputs and outputs.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Multistage Omega Networks
o The perfect shuffle patterns are connected using 2×2 switches.
o The switches operate in two modes – crossover or passthrough.

Two switching configurations of the 2 × 2 switch: (a) Pass-through; (b) Cross-over.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Multistage Omega Networks
o A complete Omega network with the
perfect shuffle interconnects and
switches can now be illustrated.
o An omega network has p/2 × log p
switching nodes, and the cost of such a
network grows as (p log p).

A complete omega network connecting eight


inputs and eight outputs.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Multistage Omega Networks - Routing
o Let s be the binary representation of the source and d be that of the
destination processor.
o The data traverses the link to the first switching node.
o If the most significant bits of s and d are the same, then the data is routed in pass-
through mode by the switch else, it switches to crossover.
o This process is repeated for each of the log p switching stages.
o Note that this is not a non-blocking switch.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Multistage Omega Networks - Routing
An example of blocking in omega network: one of the messages (010 to 111 or
110 to 100) is blocked at link AB.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(4)Completely-Connected Network
o Each processor is connected to every other processor.
o The number of links in the network scales as O(p2).
o While the performance scales very well, the hardware complexity is not
realizable for large values of p.
o In this sense, these networks are static counterparts of crossbars.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Completely-Connected & Star Connected Network
o A completely-connected network of eight nodes; (b) a star connected network
of nine nodes.

(a) A completely-connected network of eight nodes (b) a star connected network of nine nodes.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(5)Star Connected Network
o Every node is connected only to a common node at the center.
o Distance between any pair of nodes is O(1). However, the central node
becomes a bottleneck.
o In this sense, star connected networks are static counterparts of buses.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(6)Linear Arrays, Meshes, and k-d Meshes
o In a linear array, each node has two neighbours, one to its left and one to its
right. If the nodes at either end are connected, we refer to it as a 1-D torus or
a ring.
o A generalization to 2 dimensions has nodes with 4 neighbors, to the north,
south, east, and west.
o A further generalization to d dimensions has nodes with 2d neighbors.
o A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where
p is the total number of nodes.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Linear Arrays
Linear arrays:
o (a) with no wraparound links
o (b) with wraparound link.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Two- And Three Dimensional Meshes
Two and three dimensional meshes:
o (a) 2-D mesh with no wraparound;
o (b) 2-D mesh with wraparound link (2-D torus)
o (c) a 3-D mesh with no wraparound.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Hypercubes and Their Construction
Construction of hypercubes from hypercubes of lower dimension.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Properties of Hypercubes
o The distance between any two nodes is at most log p.
o Each node has log p neighbors.
o The distance between two nodes is given by the number of bit positions at
which the two nodes differ.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
(7)Tree-Based Networks
Complete binary tree networks:
o (a) a static tree network
o (b) a dynamic tree network.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Tree Properties
o The distance between any two nodes is no more than 2logp.
o Links higher up the tree potentially carry more traffic than those at the lower
levels.
o For this reason, a variant called a fat-tree, fattens the links as we go up the
tree.
o Trees can be laid out in 2D with no wire crossings. This is an attractive
property of trees.

SALIZA RAMLY - CSC580


NETWORK TOPOLOGIES:
Fat Tree
o A fat tree network of 16 processing nodes

SALIZA RAMLY - CSC580


EVALUATING INTERCONNECTION
NETWORKS
In order to evaluate the interconnection networks, four aspects are measured:
Diameter
• The maximum distance between any two processing nodes in the network.

Bisection Width
• The minimum number of wires you must cut to divide the network into two equal parts.

Connectivity
• A measure of the multiplicity of paths between any two processing nodes. High connectivity is
desirable because it lowers contention for communicating resources.

Cost
• The number of links or switches (whichever is higher) is a meaningful measure of the cost.
However, a number of other factors, such as the ability to layout the network, the length of
wires, etc., also factor in to the cost.
SALIZA RAMLY - CSC580
EVALUATING STATIC
INTERCONNECTION
NETWORKS
A SUMMARY OF THE
C H A R A C T E R I S T I C S O F VA R I O U S
S TAT I C N E T W O R K TO P O L O G I E S
CONNECTING P NODES.

SALIZA RAMLY - CSC580


Communication
Costs in Parallel
Machines

SALIZA RAMLY - CSC580


COMMUNICATION COSTS IN PARALLEL
MACHINES
o Communication is a major overhead in parallel programs.
o The cost of communication is depending on a variety of features including:
o the programming model semantics
o the network topology
o data handling and routing
o associated software protocols.
o Message passing requires little hardware support, other than a network.
o Shared address space platforms can easily follow message passing.

SALIZA RAMLY - CSC580


MESSAGE PASSING COSTS IN PARALLEL
COMPUTERS
The total time to transfer a message over a network comprises of the following:
o Startup time (ts): Time spent at sending and receiving nodes (executing the
routing algorithm, programming routers, etc.).
o Per-hop time (th): This time is a function of number of hops and includes
factors such as switch latencies, network delays, etc.
o Per-word transfer time (tw): This time includes all overheads that are
determined by the length of the message. This includes bandwidth of links,
error checking and correction, etc.

SALIZA RAMLY - CSC580


SHARED MEMORY/ADDRESS SPACE
COSTS IN PARALLEL COMPUTERS
o While the basic messaging cost applies to these machines as well, a number of
other factors make accurate cost modeling more difficult.
o Memory layout is typically determined by the system.
o Finite cache sizes can result in cache thrashing.
o Overheads associated with invalidate and update operations are difficult to
quantify.
o Spatial locality is difficult to model.
o Prefetching can play a role in reducing the overhead associated with data
access.
o False sharing and contention are difficult to model.

SALIZA RAMLY - CSC580


LECTURE4:
NEXT! PARALLEL
ALGORITHM DESIGN

SALIZA RAMLY - CSC580

You might also like