KEMBAR78
Module - 6 | PDF | Parallel Computing | Operating System Technology
0% found this document useful (0 votes)
88 views89 pages

Module - 6

The document discusses techniques for latency hiding in distributed shared memory machines. It describes three main approaches: 1) using pre-fetching techniques, 2) using coherent caching techniques, and 3) using multiple contexts. For pre-fetching, it discusses hardware and software controlled pre-fetching as well as issues of coherence. For caching, it notes the benefits of caching for performance. For multiple contexts, it explains how context switching can reduce idle processor time during remote memory accesses.

Uploaded by

anju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views89 pages

Module - 6

The document discusses techniques for latency hiding in distributed shared memory machines. It describes three main approaches: 1) using pre-fetching techniques, 2) using coherent caching techniques, and 3) using multiple contexts. For pre-fetching, it discusses hardware and software controlled pre-fetching as well as issues of coherence. For caching, it notes the benefits of caching for performance. For multiple contexts, it explains how context switching can reduce idle processor time during remote memory accesses.

Uploaded by

anju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

MODULE-6

1
MULTITHREADED AND
DATAFLOW ARCHITECTURE

2
INTRODUCTION
 In computer architecture, multithreading is the
ability of a central processing unit (CPU) (or a
single core in a multi-core processor) to execute
multiple processes or threads concurrently,
supported by the operating system

3
LATENCY HIDING TECHNIQUES
 In distributed shared memory machines, access to
remote memory is likely to be slow
 So architectures must rely on techniques to
reduce/hide remote-memory-access latencies.
 Latency hiding is a technique that hides the time
consumed in micro-task by communication and
remote memory
 3 approaches for latency hiding are:-

 1. Using pre-fetching techniques


 2. Using coherent caching techniques
 3. Using multiple contexts
4
4.Using relaxed memory consistency
1.PRE–FETCHING TECHNIQUES

 This technique reduces latency by bringing the


instructions or data, closer to the processor before
they are actually needed
 It uses the knowledge about expected misses in
the program
 Using this knowledge it moves the corresponding
data closer to the processor before it is actually
needed

5
TECHNIQUES OF PRE-FETCHING

 Hardware controlled pre-fetching


 Software controlled pre-fetching

6
HARDWARE CONTROLLED PRE-FETCHING
 Hardware-controlled pre-fetching is done using 2
schemes
 using long cache lines
 Using instruction look-ahead

 Long cache lines introduce the problem of "false


sharing”
 Instruction look-ahead is limited by
 branch-prediction accuracy
 Finite look-ahead buffer size

7
HARDWARE CONTROLLED PRE-FETCHING [2]

8
FALSE SHARING : EXAMPLE

9
SOFTWARE CONTROLLED PRE-FETCHING
 In this approach, explicit "pre-fetch" instructions
are issued for data that is "known" to be remote.
 Pre-fetching is done selectively

 Disadvantages
 Extra instruction overhead
 Need of sophisticated software intervention

10
ISSUE OF COHERENCE IN PRE-FETCHING
 We have to make decisions regarding what should
be done
 if a block is updated after it has been pre-fetched, but
before it has been used
 To address these issues of coherence while
performing pre-fetching, 2 approaches are used
 Binding pre-fetching
 Non binding pre-fetching

11
BINDING PRE-FETCH POLICY
 Fetch is assumed to have happened when the pre-
fetch instruction is issued

 It is the responsibility of the "fetching process" to


ensure that no other processor will update the pre-
fetched value before it is actually used

12
NON - BINDING PRE-FETCH POLICY
 The cache coherence protocol will make sure to
invalidate a pre-fetched value if it is updated
prior to its use.
 The pre-fetched data remains visible to the cache
coherence protocol
 Data is kept consistent until the processor
actually reads the value

13
BENEFITS OF PRE-FETCHING
 Improves the performance

 Pre-fetching the cache block line, directly with the


ownership reduces write latencies and network
traffic

14
2 . USING COHERENT CACHING TECHNIQUES
 Cache coherence is maintained using snoopy bus
protocols for bus based system
 Coherence is maintained using directory based
protocol for n/w based system

15
BENEFITS OF CACHING
 Improves the performance
 Reduces the no: of cycles wasted due to read
misses
 Cycles wasted due to write misses are also
reduced

16
3 . MULTIPLE CONTEXT PROCESSORS
 A conventional single threaded processor will wait
during a remote reference
 So processor remains idle for a period of time L
 A multithreaded processor will suspend the
current context and switch to another context
 So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
 The processor will be idle only if all contexts are
blocked

17
OBJECTIVE OF CONTEXT SWITCHING
 Maximize the fraction of time that the processor
is busy
 Efficiency of the processor is defined as:-

 Efficiency = busy
busy + switching+ idle

Busy, switching and idle represents the amount of


time the processor remains in that corresponding
state
18
 The basic idea of multithreaded machine is to
 Interleave the execution of several contexts inorder to
reduce the value of idle without increasing the
magnitude of switching

19
STATES OF CONTEXT CYCLE
 A context cycles through the following states
 Ready
 Running
 Leaving
 Blocked

 A processor will be busy, if there is a context in the


running state
 If the processor is switching from one context to
another, then context is in leaving state
 If the processor is in idle state, then it implies that
all contexts are blocked
20
PRINCIPLES OF MULTITHREADING

21
MULTITHREADING ISSUES AND SOLUTIONS
 Multithreading demands a processor to handle
multiple contexts simultaneously on a context
switching basis
 Architecture environment
 Consider a multithreaded system modeled using a n/w of
processor & memory nodes
 Processor P
 memory M

 Distributed memory forms a global address space

22
23
PARAMETERS USED TO ANALYZE PERFORMANCE
 Latency (L)
 It involves the communication latency on a remote
memory access
 The value of L includes
 n/w delays
 Cache miss penalty

 Delays caused by contentions in split transactions

 No: of threads (N)


 It is the no: of threads that can be interleaved in each
processor
 A thread is represented by a context
 A context consists of
 PC
24
 Register set

 Required context status words


 Context switching overhead ( C)
 This refers to the cycles lost in performing context
switching in a processor
 This depends on
 Switching mechanism
 Amount of processor states devoted to maintain active

threads
 Interval between switches (R)
 This refers to the cycles between switches triggered by
remote reference
 The inverse p=1/R is called rate of requests for
remote accesses
 This reflects the combination of
 Program behavior

 Memory s/m design

25
PROBLEMS OF ASYNCHRONY

 It is a problem found commonly in parallel


processors
 Parallel processors operate asynchronously in a
n/w environment
 This lead to 2 latency problems
 Remote loads
 Synchronizing load

26
REMOTE LOADS
 Suppose that variable A and B are located on
nodes N2 and N3
 They need to be brought to node N1 in order to
compute A-B in variable C
 This computation demands 2 remote loads and a
subtraction
 Remote load rload
 pA and pB pointers to A and B
 CTXT context of computation on N1
 Remote loads keep the node N1 idle until it
receives the variable A and B from N2 and N3
 Latency caused by remote load depends on the
27
architectural property
28
SYNCHRONIZING LOADS
 In this case idling is caused because A and B are
computed by concurrent processes
 It is not sure that when will they will be ready for
node N1 to read
 The ready signal may reach node N1
asynchronously
 This may result in Busy-waiting
 Process repeatedly checks to see whether A and B is
ready
 Latency caused by synchronization load depends on
 Scheduling
 Time taken to compute A & B
29
SOLUTIONS TO ASYNCHRONY PROBLEM
 2 solutions
 Multithreading solution
 Distributed caching

30
MULTITHREADING SOLUTION
 The solution is to multiplex among many threads
 When one thread issues a remote load request,
processor begins work on another thread and so on
 Cost of thread switching is smaller than the
latency of remote load
 Concerns of this approach
 Responses of remote load may not return in proper
order
 Ie after issuing a remote load from thread T1, we switch to
T2, which also issues a remote load
 Responses may not return in the same order because
 requests may be travelling from different distances
 Varying degree of congestion
31
 Load of destination node differ greatly
32
 Solution to this concern
 Make sure that messages carry continuations
 Each remote load and response is associated with an
identifier for the appropriate thread
 Thread identifiers are referred as continuations on
messages

33
DISTRIBUTED CACHING

 This is a solution for remote load problem


 Every memory location has an owner node
 N1 owns B
 N2 owns A
 Directories contain
 Import- Export lists
 It states whether data is shared or exclusive
 Shared many caches may hold copies(used for reads)
 Exclusive one cache holds the current value(used for writes)

34
35
MULTIPLE CONTEXT PROCESSORS
 A conventional single threaded processor will
wait during a remote reference
 So processor remains idle for a period of time L
 A multithreaded processor will suspend the
current context and switch to another context
 So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
 The processor will be idle only if all contexts are
blocked

36
OBJECTIVE OF CONTEXT SWITCHING
 Maximize the fraction of time that the processor
is busy
 Efficiency of the processor is defined as:-

 Efficiency = busy
busy + switching+ idle

Busy, switching and idle represents the amount of


time the processor remains in that corresponding
state
37
 The basic idea of multithreaded machine is to
 Interleave the execution of several contexts inorder to
reduce the value of idle without increasing the
magnitude of switching

38
STATES OF CONTEXT CYCLE
 A context cycles through the following states
 Ready
 Running
 Leaving
 Blocked

 A processor will be busy, if there is a context in


the running state
 If the processor is switching from one context to
another, then context is in leaving state
 If the processor is in idle state, then it implies
that all contexts are blocked
39
CONTEXT SWITCHING POLICIES
 Switch on cache miss
 Switch on every load

 Switch on every instruction

 Switch on block of instruction

40
SWITCH ON CACHE MISS
 In this case a context is switched when a cache
miss occurs
 R is the average interval b/w the misses

 L is the time required to satisfy the miss

 Processor switches the context only when it is


certain that current one will be delayed for
significant no: of cycles

41
SWITCH ON EVERY LOAD
 This policy allows switching on every load,
independent of whether it will cause a miss or
not
 R represents average interval between the loads

42
SWITCH ON EVERY INSTRUCTION
 This policy allows switching on every instruction,
independent of whether it is load or not
 It interleaves the instructions from different
threads on a cycle by cycle basis
 If successive instructions are independent, then
it will benefit pipelined execution

43
SWITCH ON BLOCK OF INSTRUCTION
 Blocks of instructions from different threads are
interleaved
 Benefits single context performance

44
PROCESSOR EFFICIENCIES
 A single threaded processor executes a context
until a remote reference is issued
 It remains idle until the reference completes L
cycles
 There is no context switch hence no switch
overhead
 Efficiency of a single threaded machine is:-
 E1=R  1
R+L 1+L/R
R amount of time during a cycle that the processor is
busy
L amount of time during a cycle that the processor is
idle 45
EFFICIENCY OF MULTITHREADED PROCESSOR
 Memory latency is hidden due to context
switching
 But there is switching overhead of C cycles

 Processor efficiency is analyzed under 2


situations
 Efficiency in saturation region
 Efficiency in linear region

46
SATURATION REGION
 In saturation region, processor operates with
maximum utilization
 Esat= R  1
R+ C 1+C/R
 Efficiency in saturation is independent of latency

 It does not change with a further increase in no:


of contexts

47
48
LINEAR REGION
 When the no: of contexts is below the saturation
point, there may be no ready contexts after a context
switch
 So the processor will experience idle cycles
 The time required to switch to a ready context,
execute it until a remote reference is issued and
process the reference is equal to:-
 R + C+ L
 Elin= NR
R+C+L
 N is below the saturation point
 Efficiency increases linearly with no: of contexts until
the saturation point is reached
 Beyond that saturation point it remains constant 49
FINE GRAIN MULTICOMPUTERS

50
INTRODUCTION
 Parallelism can be classified into three categories
 1. Fine-grained
 2. Medium-grained
 3. Coarse-grained parallelism

 Classification is based on the amount of work


which is performed by a parallel task
 Granularity of a task is the amount of
computations or work performed by a task
 Granularity can also be specified in terms of the
execution time of a program
 It is obtained by combining the computation time and
communication time
51
1. FINE GRAINED PARALLELISM
 In fine-grained parallelism, a program is broken down
to a set of small tasks.
 These tasks are assigned individually to many
processors.
 The amount of work associated with a parallel task is
low
 Each task processes less data
 No: of processors required to perform the complete
processing is high.
 This increases the communication and synchronization
overhead.
 Fine grained parallelism is utilized in following
parallel computers
 CM-2
 Message driven J-Machine
 Mosaic C 52
2. MEDIUM GRAINED PARALLELISM
 Medium-grained parallelism is a compromise
between fine-grained and coarse-grained
parallelism
 task size and communication time greater than
fine-grained parallelism and lower than coarse-
grained parallelism.
 Eg: Intel iPSC/1

53
3. COARSE GRAINED PARALLELISM
 In coarse-grained parallelism, a program is split
into large tasks.
 Large amount of computation takes place in
processors.
 This result in load imbalance
 Certain tasks process bulk of the data while others
might be idle.
 Coarse-grained parallelism fails to exploit the
parallelism in the program
 The advantage of this type of parallelism is low
communication and synchronization overhead.
 Eg: Cray Y-MP 54
CHARACTERISTICS OF PARALLEL MACHINES
 Latency analysis
 Communication latency (Tc)
 This measures the data or message transfer time on a

system interconnect
 In Cray Y-MP, latency implies the shared memory access

time
 In CM-2 machine, latency implies the time required to

send a 32 bit value across a hypercube n/w


 In IPSC/1 and J-Machine, latency implies the latency in

the n/w

55
 Synchronization overhead (Ts)
 It is the processing time required on a processor, PE, or
on a processing node of a multicomputer for the purpose
of synchronization
 Total time for IPC ( Instruction Per Cycle)
 It is the sum of Tc and Ts
 TIPC= Tc+Ts

56
COMPARISON OF CHARACTERISTICS

 Kkkk

 Cray-Y has short Tc but long Ts


 CM-2 has short Ts but long Tc
 iPSC/1 has longer delays
 Major improvements was done by J-Machine 57
GRAIN SIZE IN FINE GRAINED PARALLELISM
Grain size (Tg)
 It is measured by the execution time of a typical
program
 It includes both computing time and
communication time
 Fine grain leads to higher degree of parallelism
and higher communication overhead

58
DATAFLOW AND HYBRID
ARCHITECTURE

68
INTRODUCTION

 Computers follow 2 types of program execution


mechanisms
 Control flow mechanism
 Dataflow mechanism

69
CONTROL FLOW MECHANISM
 In control flow mechanism, the order of program
execution is explicitly stated in the user program
 Ie the programs are executed in the sequential order
 Computers use Program counter to sequence the
execution of instructions in the program
 This style of program execution is also called as
control driven
 Coz program flow is explicitly controlled by the
programmers
 Eg: Conventional Von Neumann computers follow
this type of mechanism
70
DATAFLOW MECHANISM
 This mechanism allows execution of any
instruction, based on the data (operand)
availability
 Does not follow program order
 Computers that follow dataflow mechanism is
called as dataflow computers
 This style of execution is also called as data
driven execution
 Does not require PC or a control sequencer

71
DATAFLOW COMPUTERS
 The execution of instruction is driven by data
availability instead of the guidance of program
counter
 Instructions will be ready for execution whenever the
operands are available
 Instructions in a data driven program is not ordered

 Data’s are held directly inside the instruction


 It is not stored in the memory
 Dataflow computers provide high degree of
parallelism

72
 Computational results (tokens) are passed directly
between the instructions
 Data generated by one instruction will be duplicated into
many copies & they are forwarded directly to all needy
instructions
 Tokens once consumed by the instruction will not be
available for other instructions

73
REQUIREMENTS FOR DATAFLOW COMPUTERS
 It require special mechanisms to:-
 Detect data availability
 Match data tokens with needy instructions
 Enable chain reaction of asynchronous instruction
execution

74
DATAFLOW ARCHITECTURE

 Global architecture consist of n processing


elements interconnected by nxn routing n/w
 Entire s/m supports pipelined dataflow operations
in all n PE’s
 Inter PE communications are done through
pipelined routing n/w

75
76
INTERIOR DESIGN OF PE
 Within each PE, there are following units
 Token match
 Program memory
 Compute tag
 Local path
 Routing n/w
 ALU
 I-structure

77
78
 Token match unit
 This unit provides a token matching mechanism
 It dispatches only those instructions to the ALU,
whose input data (tokens) are available
 Compute tag
 This unit provides a tagging mechanism
 Each data is tagged with the address of the
instruction to which it belongs
 Program memory
 Instructions are stored in the program memory
 Local path
 Tagged tokens enter into the PE through the local
path
 Routing network
 Tokens are passed to other PE’s through the routing
network 79
 I – structure
 It is a tagged memory unit for the overlapped usage
of a data structure by both the producer and
consumer processes
 Each word of I-structure uses a 2 bit tag
 This indicates whether the word is:-
 Empty

 Full

 Has pending read requests

80
DATAFLOW & HYBRID ARCHITECTURE
 Multithreaded architectures can be designed with
pure dataflow approach or with a hybrid approach
 Hybrid approach
 Combining Von Neumann and data driven mechanism
 Dataflow graph
 These are used as the machine language for dataflow
architectures
 They specify only partial order for the execution of
instructions

81
DATAFLOW GRAPH FOR CALCULATION OF COS X
 Cos x= 1- x2 + x4 – x6
2! 4! 6!

= 1 - x2 + x4 – x6
2 24 720

data flow graph contains 9 operators/nodes


edges in the graph interconnect nodes
successive powers of x is obtained by repeated
multiplications
82
83
EVOLUTION OF DATAFLOW COMPUTERS
 Pure dataflow computers
 MIT tagged token dataflow architecture
 Manchester dataflow
 ETL sigma 1

 Explicit token store machines


 They are the successors of pure dataflow machines
 They eliminate associative token matching
 The waiting token memory is directly addressed with
the use of full/empty bits
 Eg: motorola monsoon
 ETL EM-4

84
85
 Hybrid and unified architecture
 This combines the positive features from Von
Neumann and dataflow architectures
 Eg: MIT/Motorola *T
 MIT P-RISC

86
ETL/EM4 MACHINE
 Each node consisted of 2 elements]
 EMC-R processor
 Memory

 Each node played the role of I-structure memory


 Had 1.31 MB of static RAM

 Omega n/w was used to provide interconnections


among the node

87
88
NODE ARCHITECTURE
 Processor chip communicated with the n/w
through 3x3 cross bar switch unit
 Processor and the memory is interfaced using
memory control unit
 Memory holds programs as well as tokens
waiting to be fetched
 Processor contains 6 units
 Input buffers
 Fetch match unit
 Execution unit
 Instruction fetch
 Execute & emit tokens
89
 Register file
90
 Input buffer
 It is used as a token store
 It has a capacity of 32 words

 Fetch match unit


 It fetches the token from the memory
 It performed tag matching operations among the
fetched tokens
 Memory controller
 It fetches the instructions directly from the memory

91
 Execution unit
 It is the heart of the processor
 It fetches the instruction until the end of the thread
 Stop flag is raised to indicate the end of a thread
 Instructions with matching tokens are executed
 Instructions will emit the tokens
 These emitted tokens are written into register file

92
MIT/MOTOROLA *T PROTOTYPE
 *T project is a hybrid of dynamic dataflow
architecture and Von Neumann architecture
 Prototype architecture
 A brick of 16 nodes was packaged in a 9-in cube
 Local n/w consisted of 8x8 crossbar switches
 Memory was distributed among the nodes
 1 GB RAM was used per brick

93
94
*T NODE DESIGN
 Each node has 4 components
 MC88110 data processor (dP)
 Synchronization coprocessor
 Memory controller
 n/w interface unit

95
 Data processor (dP)
 It was optimized for long threads
 Concurrent integer & floating point operations were
performed in within each dP
 Synchronization coprocessor (sP)
 It is implemented as a special function unit
 It was optimized for short threads
 Both dP and sP could handle fast loads
 dP handled incomming continuations
 sP handled
 Incoming messages
 Rload/rstore responses

 Synchronization

96
97
 Memory controller
 Handled the requests for remote memory load or
store
 It manages node memory

 Network interface unit


 It transmitted or receive messages to and from the
n/w

98

You might also like