Parallel Computation
Models Lecture 2
Slide 1
Parallel Computation Models
• PRAM (parallel RAM)
• Fixed Interconnection Network
– bus, ring, mesh, hypercube, shuffle-exchange
• Boolean Circuits
• Combinatorial Circuits
• BSP (Bulk Synchronous Parallel) • LogP (L:
Upper bound for the latency, O: Overhead, G –
Gap, P – No. of Processors) Slide 2
PARALLEL AND
DISTRIBUTED COMPUTATION
• MANY INTERCONNECTED PROCESSORS WORKING CONCURRENTLY
P4 P5
P3
INTERCONNECTION
NETWORK
P2
. . . . Pn P1
• CONNECTION MACHINE
the world
Slide 3
• INTERNET Connects all the computers of
TYPES OF
MULTIPROCESSING
FRAMEWORKS
PARALLEL DISTRIBUTED
TECHNICAL ASPECTS
•PARALLEL COMPUTERS (USUALLY) WORK IN TIGHT SYNCRONY, SHARE MEMORY TO A LARGE
EXTENT AND HAVE A VERY FAST AND RELIABLE COMMUNICATION MECHANISM BETWEEN
THEM.
• DISTRIBUTED COMPUTERS ARE MORE
INDEPENDENT, COMMUNICATION IS LESS FREQUENT AND LESS SYNCRONOUS,
AND THE COOPERATION IS LIMITED.
PURPOSES
• PARALLEL COMPUTERS COOPERATE TO SOLVE MORE EFFICIENTLY (POSSIBLY)
DIFFICULT PROBLEMS
• DISTRIBUTED COMPUTERS HAVE INDIVIDUAL GOALS AND PRIVATE
ACTIVITIES. SOMETIME COMMUNICATIONS WITH OTHER ONES ARE NEEDED. (E. G.
DISTRIBUTED DATA BASE OPERATIONS).
PARALLEL COMPUTERS: COOPERATION IN A POSITIVE SENSE
DISTRIBUTED COMPUTERS:
COOPERATION IN A NEGATIVE
SENSE, ONLY WHEN IT IS
NECESSARY
Slide 4
FOR PARALLEL SYSTEMS
WE ARE INTERESTED TO SOLVE ANY PROBLEM IN PARALLEL
FOR DISTRIBUTED SYSTEMS
WE ARE INTERESTED TO SOLVE IN PARALLEL
PARTICULAR PROBLEMS ONLY, TYPICAL EXAMPLES
ARE:
•COMMUNICATION SERVICES
ROUTING
BROADCASTING
•MAINTENANCE OF CONTROL STUCTURE
SPANNING TREE CONSTRUCTION
TOPOLOGY UPDATE
LEADER ELECTION
•RESOURCE CONTROL ACTIVITIES
LOAD BALANCING
MANAGING GLOBAL DIRECTORIES
Slide 5
PARALLEL ALGORITHMS
• WHICH MODEL OF COMPUTATION IS THE BETTER TO USE?
• HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL
ALGORITHM? • HOW TO CONSTRUCT EFFICIENT ALGORITHMS?
MANY CONCEPTS OF THE COMPLEXITY THEORY MUST BE REVISITED
• IS THE PARALLELISM A SOLUTION FOR HARD PROBLEMS?
• ARE THERE PROBLEMS NOT ADMITTING AN EFFICIENT PARALLEL SOLUTION,
THAT IS INHERENTLY SEQUENTIAL PROBLEMS?
Slide 6
We need a model of computation
• NETWORK (VLSI) MODEL
• The processors are connected by a network of bounded
degree. • No shared memory is available.
• Several interconnection topologies.
• Synchronous way of operating.
MESH CONNECTED ARRAY
degree = 4 (N) diameter = 2N
Slide 7
CUBE
0111
0110
HYPER
0101
0010 1110
1111
0100
1101
1010
diameter = 4
degree = 4 (log2N)
0011
1011
1000 1001
0000 0001
1100
N = 24 PROCESSORS
Slide 8
Other important topologies
• binary trees
• mesh of trees
• cube connected cycles
In the network model a PARALLEL MACHINE is a very complex
ensemble of small interconnected units, performing elementary
operations.
- Each processor has its own memory.
- Processors work synchronously.
LIMITS OF THE MODEL
• different topologies require different algorithms to solve the same
problem
• it is difficult to describe and analyse algorithms (the migration
of data have to be described)
A shared-memory model is more suitable by an algorithmic point of view
Slide 9
Model Equivalence
• given two models M1and M2, and a problem Π
of size n
• if M1and M2are equivalent then solving Π
requires:
– T(n) time and P(n) processors on M1
– T(n)O(1) time and P(n)O(1) processors on M2
Slide 10
PRAM
• Parallel Random Access Machine •
Shared-memory multiprocessor •
unlimited number of processors, each –
has unlimited local memory
– knows its ID
– able to access the shared memory
• unlimited shared memory
Slide 11
MODEL 1
P 1
2
PRAM 3
P2 .
. Common
Pi . Memory
.
P .
n
?
m
PRAM n RAM processors connected to a common memory of m cells
ASSUMPTION: at each time unit each Pi can read a memory cell, make an internal
computation and write another memory cell.
CONSEQUENCE: any pair of processor Pi Pj can communicate in constant time!
Pi writes the message in cell x at time t
Pireads the message in cell x at time t+1
Slide 12
PRAM
• Inputs/Outputs are placed in the shared
memory (designated address)
• Memory cell stores an arbitrarily large
integer
• Each instruction takes unit time •
Instructions are synchronized across the
processors
Slide 13
PRAM Instruction Set
• accumulator architecture
– memory cell R0accumulates results
• multiply/divide instructions take only
constant operands
– prevents generating exponentially large
numbers in polynomial time
Slide 14
PRAM Complexity Measures
• for each individual processor
– time: number of instructions executed
– space: number of memory cells accessed
• PRAM machine
– time: time taken by the longest running processor
– hardware: maximum number of active processors
Slide 15
Two Technical Issues for PRAM
• How processors are activated
• How shared memory is accessed
Slide 16
Processor Activation
• P0 places the number of processors (p) in the
designated shared-memory cell
– each active Pi, where i < p, starts executing
– O(1) time to activate
– all processors halt when P0 halts
• Active processors explicitly activate additional
processors via FORK instructions
– tree-like activation
– O(log p) time to activate
Slide 17
THE PRAM IS A THEORETICAL (UNFEASIBLE) MODEL
• The interconnection network between processors and memory would require
a very large amount of area .
• The message-routing on the interconnection network would require time
proportional to network size (i. e. the assumption of a constant access time
to the memory is not realistic).
WHY THE PRAM IS A REFERENCE MODEL?
• Algorithm’s designers can forget the communication problems and focus their
attention on the parallel computation only.
•There exist algorithms simulating any PRAM algorithm on bounded degree
networks.
E. G. A PRAM algorithm requiring time T(n), can be simulated in a mesh of tree in
time T(n)log2n/loglogn, that is each step can be simulated with a slow-down of
log2n/loglogn.
• Instead of design ad hoc algorithms for bounded degree networks, design more
general algorithms for the PRAM model and simulate them on a feasible network.
Slide 18
• For the PRAM model there exists a well developed body of
techniques and methods to handle different classes of computational
problems.
• The discussion on parallel model of computation is still
HOT The actual trend:
COARSE-GRAINED MODELS
•
The degree of parallelism allowed is independent
from the number of processors.
• The computation is divided in supersteps, each one includes
• local computation
• communication phase
• syncronization phase
the study is still
at the beginning!
Slide 19
Metrics
A measure of relative performance between a multiprocessor
system and a single processor system is the speed-up S( p),
defined as follows:
S( p) =Execution time using a single processor system Execution time
using a multiprocessor with p processors
S( p) =T1
TpEfficiency =Sp
p
Cost = p × Tp
Slide 20
Metrics cost-optimal: parallel cost
= sequential time
• Parallel algorithm is
Cp = T1
Ep = 100%
• Critical when
down-scaling: parallel
implementation may
become slower than
sequential
T1 = n3
Tp = n2.5 when p = n2
Cp = n4.5
Slide 21
Amdahl’s Law
• f = fraction of the problem that’s
inherently sequential
(1 – f) = fraction that’s parallel
=
• Parallel time Tp: 11
Tp= f + (1− f ) p
• Speedup processors: f
with p Sp−
f + p
Slide 22
Amdahl’s Law
• Upper bound on
speedup (p = ∞)
= S1
11Converges to 0
Sp− f =
∞ f
f Exa 2%
mple +
• p
:f=
S = 1 / 0.02 = 50
Slide 23
PRAM
• Too many interconnections gives problems with synchronization •
However it is the best conceptual model for designing efficient
parallel algorithms
– due to simplicity and possibility of simulating efficiently PRAM
algorithms on more realistic parallel architectures
Slide 24
Shared-Memory Access
Concurrent (C) means, many processors can do the operation simultaneously in
the same memory
Exclusive (E) not concurent
• EREW (Exclusive Read Exclusive Write)
• CREW (Concurrent Read Exclusive Write)
– Many processors can read simultaneously
the same location, but only one can attempt to write to a given location
• ERCW
• CRCW
– Many processors can write/read at/from the same memory location
Slide 25
Example CRCW-PRAM
• Initially
– table A contains values 0 and 1
– output contains value 0
• The program computes the “Boolean OR” of
A[1], A[2], A[3], A[4], A[5]
Slide 26
Example CREW-PRAM
• Assume initially table A contains [0,0,0,0,0,1] and we
have the parallel program
Slide
27
Pascal triangle
PRAM CREW
Slide 28