ECE/CS
757:
Advanced
Computer
Architecture
II
Instructor:Mikko
H
Lipas>
Spring
2009
University
of
Wisconsin-Madison
Lecture
notes
based
on
slides
created
by
John
Shen,
Mark
Hill,
David
Wood,
Guri
Sohi,
and
Jim
Smith,
Natalie
Enright
Jerger,
and
probably
others
Cache
Coherence
Coherence
States
Snoopy
bus-based
Invalidate
Protocols
Invalidate
protocol
op>miza>ons
Update
Protocols
(Dragon/Firey)
Directory
protocols
Implementa>on
issues
Readings
Readings:
Firey
Archibald
Sweazey/Smith
Laudon/Lenoski:
Origin
2000
Opteron
Gigaplane
Power5
Intel
870
3
Cache
Coherence
Problem
Load A
Store A<= 1
P0
P1
01
Load A
Load A
Memory
2005 Mikko Lipasti
Cache
Coherence
Problem
Load A
Store A<= 1
P0
P1
10
Load A
Load A
10
Memory
2005 Mikko Lipasti
Possible
Causes
of
Incoherence
Sharing
of
writeable
data
Cause
most
commonly
considered
Process
migra>on
Can
occur
even
if
independent
jobs
are
execu>ng
I/O
O[en
xed
via
O/S
cache
ushes
02/07
ECE/CS 757; copyright J. E. Smith, 2007
Cache
Coherence
Informally,
with
coherent
caches:
accesses
to
a
memory
loca>on
appear
to
occur
simultaneously
in
all
copies
of
the
memory
loca>on
copies
caches
Cache
coherence
suggests
an
absolute
>me
scale
--
this
is
not
necessary
What
is
required
is
the
"appearance"
of
coherence...
not
absolute
coherence
E.g.
temporary
incoherence
between
memory
and
a
write-back
cache
may
be
OK.
02/07
ECE/CS 757; copyright J. E. Smith, 2007
7
Update
vs.
Invalida>on
Protocols
Coherent
Shared
Memory
All
processors
see
the
eects
of
others
writes
How/when
writes
are
propagated
Determine
by
coherence
protocol
2005 Mikko Lipasti
Global
Coherence
States
A
memory
line
can
be
present
(valid)
in
any
of
the
caches
and/or
memory
Represent
global
state
with
an
N+1
element
vector
First
N
components
=>
cache
states
(valid/invalid)
N+1st
component
=>
memory
state
(valid/invalid)
Example:
Line
A:
<1,1,0,1>
Line
B:
<1,0,0,0>
Line
C:
<0,0,0,1>
Cache 0
line A: V
line B: I
line C: V
Cache 1
line A
line B
02/07
Memory
Cache 2
line A
ECE/CS 757; copyright J. E. Smith, 2007
Local
Coherence
States
Individual
caches
can
maintain
a
summary
of
the
state
of
memory
lines,
from
a
local
perspec>ve
Reduces
storage
for
maintaining
state
May
have
only
par>al
informa>on
Invalid
(I):
<0,X,X,X....X>
--
local
cache
does
not
have
a
valid
copy;
(cache
miss)
Dont
confuse
invalid
state
with
empty
frame
Shared
(S):
<1,X,X,X,,1>
--
local
cache
has
a
valid
copy,
main
memory
has
a
valid
copy,
other
caches
??
Modied(M):
<1,0,0,..0,0>
--
local
cache
has
only
valid
copy.
Exclusive(E):
<1,0,0,..0,1>
--
local
cache
has
a
valid
copy,
no
other
caches
do,
main
memory
has
a
valid
copy.
Owned(O):
<1,X,X,X,.X>
--
local
cache
has
a
valid
copy,
all
other
caches
and
memory
may
have
a
valid
copy.
Only
one
cache
can
be
in
O
state
<1,X,1,X,
0>
is
included
in
O,
but
not
included
in
any
of
the
others.
02/07
ECE/CS 757; copyright J. E. Smith, 2007
Example
Memory
line A: V
line B: I
line C: V
Cache 0
line A
line B
Cache 1
Cache 2
line A
Memory
line A: V
line B: I
line C: V
Cache 0
line A: S
line B: M
line C: I
02/07
Cache 1
line A: S
line B: I
line C: I
Cache 2
line A: I
line B: I
line C: I
ECE/CS 757; copyright J. E. Smith, 2007
11
Snoopy
Cache
Coherence
All
requests
broadcast
on
bus
All
processors
and
memory
snoop
and
respond
Cache
blocks
writeable
at
one
processor
or
read-
only
at
several
Single-writer
protocol
Snoops
that
hit
dirty
lines?
Flush
modied
data
out
of
cache
Either
write
back
to
memory,
then
sa>sfy
remote
miss
from
memory,
or
Provide
dirty
data
directly
to
requestor
Big
problem
in
MP
systems
Dirty/coherence/sharing
misses
2005 Mikko Lipasti
12
Bus-Based
Protocols
Protocol
consists
of
states
and
ac>ons
(state
transi>ons)
Ac>ons
can
be
invoked
from
processor
or
bus
Bus
Bus Actions
Cache
Controller
Processor Actions
Processor
02/07
ECE/CS 757; copyright J. E. Smith, 2007
State
Tags
Cache Data
Minimal
Coherence
Protocol
FSM
[Source:
Pamerson/Hennessy,
Comp.
Org.
&
Design]
2005 Mikko Lipasti
14
MSI
Protocol
Action and Next State
Current
State
02/07
Processor
Read
Processor
Write
Eviction
Cache Read
Acquire
Copy
S
Cache Read&M
Acquire Copy
M
No Action
S
Cache Upgrade
M
No Action
M
No Action
M
Cache
Read
Cache
Read&M
Cache
Upgrade
No Action
I
No Action
I
No Action
I
No Action
I
No Action
S
Invalidate
Frame
I
Invalidate
Frame
I
Cache
Write
back
I
Memory
inhibit;
Supply
data;
S
Invalidate
Frame;
Memory
inhibit;
Supply data;
I
ECE/CS 757; copyright J. E. Smith, 2007
MSI
Example
Thread Event
Bus Action
Data From
0. Initially:
Memory
Global State
Local States:
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR
2. T0 write
CU
3. T2 read
CR
C0
<1,0,1,1>
4. T1 write
CRM
Memory
<0,1,0,0>
If
line
is
in
no
cache
Read,
modify,
Write
requires
2
bus
transac>ons
Op>miza>on:
add
Exclusive
state
02/07
ECE/CS 757; copyright J. E. Smith, 2007
Invalidate
Protocol
Op>miza>ons
Observa>on:
data
can
be
read
shared
Add
S
(shared)
state
to
protocol:
MSI
State
transi>ons:
Local
read:
I->S,
fetch
shared
Local
write:
I->M,
fetch
modied;
S->M,
invalidate
other
copies
Remote
read:
M->S,
supply
data
Remote
write:
M->I,
supply
data;
S->I,
invalidate
local
copy
Observa>on:
data
can
be
write-private
(e.g.
stack
frame)
Avoid
invalidate
messages
in
that
case
Add
E
(exclusive)
state
to
protocol:
MESI
State
transi>ons:
Local
read:
I->E
if
only
copy,
I->S
if
other
copies
exist
Local
write:
E->M
silently,
S->M,
invalidate
other
copies
2005 Mikko Lipasti
17
MESI
Protocol
Varia>on
used
in
Pen>um
Pro/2/3
(cache
to
cache
transfer)
4-State
Protocol
Modied:
<1,0,00>
Exclusive:
<1,0,0,,1>
Shared:
<1,X,X,,1>
Invalid:
<0,X,X,X>
Bus/Processor
Ac>ons
Same
as
MSI
Adds
shared
signal
to
indicate
if
other
caches
have
a
copy
02/07
ECE/CS 757; copyright J. E. Smith, 2007
18
MESI
Protocol
Action and Next State
Current
State
02/07
Processor
Read
Processor
Write
Eviction
Cache
Read
If no
sharers:
E
If sharers:
S
Cache Read&M
M
No Action
S
Cache Upgrade
M
No Action
E
No Action
M
No Action
M
No Action
M
Cache
Read
Cache
Read&M
Cache
Upgrade
No Action
I
No Action
I
No Action
I
No Action
I
Respond
Shared:
S
No Action
I
No Action
I
No Action
I
Respond
Shared;
S
No Action
I
Cache
Write-back
I
Respond
dirty;
Write back
data;
S
Respond
dirty;
Write back
data;
I
ECE/CS 757; copyright J. E. Smith, 2007
MESI
Example
Op>miza>on
cache-to-cache
transfer
for
shared
data
(but
only
one
sharer
can
respond)
If
modied
in
some
cache
and
another
reads
then
writeback
to
memory
(w/
snarf)
Op>miza>on:
add
Owned
state
(and
perform
cache-to-cache
transfer)
Thread Event
Bus
Action
Data From
0. Initially:
1. T0 read
CR
2. T0 write
none
02/07
Memory
Global State
Local States:
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
ECE/CS 757; copyright J. E. Smith, 2007
MOESI
Op>miza>on
Observa>on:
shared
ownership
complicates/delays
sourcing
data
Owner
is
responsible
for
sourcing
data
to
any
requestor
Add
O
(owner)
state
to
protocol:
MOSI/MOESI
Last
requestor
becomes
the
owner
Ownership
can
be
on
per-node
basis
in
hierarchically
structured
system
Avoid
writeback
(to
memory)
of
dirty
data
Also
called
shared-dirty
state,
since
memory
is
stale
2005 Mikko Lipasti
21
MOESI
Protocol
Used
in
AMD
Opteron
5-State
Protocol
Modied:
<1,0,00>
Exclusive:
<1,0,0,,1>
Shared:
<1,X,X,,X>
Invalid:
<0,X,X,X>
Owned:
<1,X,X,X,X>
;
only
one
owner
Owner
can
supply
data,
so
memory
does
not
have
to
Avoids
lengthy
memory
access
02/07
ECE/CS 757; copyright J. E. Smith, 2007
22
MOESI
Protocol
Action and Next State
Current
State
02/07
Processor
Read
Processor
Write
Eviction
Cache Read
Cache Read&M
Cache
Upgrade
Cache Read
If no sharers:
E
If sharers:
S
Cache Read&M
M
No Action
I
No Action
I
No
Action
I
No Action
S
Cache Upgrade
M
No Action
I
Respond
shared;
S
No Action
I
No
Action
I
No Action
E
No Action
M
No Action
I
Respond
shared;
Supply data;
S
Respond
shared;
Supply data;
I
No Action
O
Cache
Upgrade
M
Cache
Write-back
I
Respond
shared;
Supply data;
O
Respond
shared;
Supply data;
I
No Action
M
No Action
M
Cache
Write-back
I
Respond
shared;
Supply data;
O
Respond
shared;
Supply data;
I
ECE/CS 757; copyright J. E. Smith, 2007
MOESI
Example
Thread Event
Bus Action
Data From
0. Initially:
local states
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR
2. T0 write
none
3. T2 read
CR
C0
<1,0,1,0>
4. T1 write
CRM
C0
<0,1,0,0>
02/07
Memory
Global State
ECE/CS 757; copyright J. E. Smith, 2007
Update
Protocols
Basic
idea:
All
writes
(updates)
are
made
visible
to
all
caches:
(address,value)
tuples
sent
everywhere
Similar
to
write-through
protocol
for
uniprocessor
caches
Obviously
not
scalable
beyond
a
few
processors
No
one
actually
builds
machines
this
way
Simple
op>miza>on
Send
updates
to
memory/directory
Directory
propagates
updates
to
all
known
copies:
less
bandwidth
Further
op>miza>ons:
combine
&
delay
Write-combining
of
adjacent
updates
(if
consistency
model
allows)
Send
write-combined
data
Delay
sending
write-combined
data
un>l
requested
Logical
end
result
Writes
are
combined
into
larger
units,
updates
are
delayed
un>l
needed
Eec>vely
the
same
as
invalidate
protocol
2005 Mikko Lipasti
25
Update
Protocol:
Dragon
Dragon
(developed
at
Xerox
PARC)
5-State
Protocol
Invalid:<0,X,X,X>
Some
say
no
invalid
state
due
to
confusion
regarding
empty
frame
versus
invalid
line
state
Exclusive:
<1,0,0,,1>
Shared-Clean
(Sc):
<1,X,X,X>
memory
may
not
be
up-to-date
Shared-Modied
(Sm):
<1,X,X,X0>
memory
not
up-to-date;
only
one
copy
in
Sm
Modied:
<1,0,0,0>
Includes
Cache
Update
ac>on
Includes
Cache
Writeback
ac>on
Bus
includes
Shared
ag
Appears
to
also
require
memory
inhibit
signal
Dis>nguish
shared
case
where
cache
(not
memory)
supplies
data
02/07
ECE/CS 757; copyright J. E. Smith, 2007
26
Dragon
State
Diagram
Action and Next State
Current
State
Processor
Read
Processor
Write
Eviction
Cache Read
If no sharers:
E
If sharers:
Sc
Cache Read
If no sharers:
M
If sharers:
Cache Update
Sm
Sc
No Action
Sc
Cache Update
If no sharers:
M
If sharers:
Sm
No Action
I
No Action
E
No Action
No Action
I
Respond shared;
Supply data
Sc
Respond shared;
Supply data;
Sm
02/07
Cache Read
Sm
No Action
Sm
Cache Update
If no sharers:
M
If sharers:
Sm
Cache Write-back
I
No Action
M
No Action
M
Cache Write-back
I
Respond Shared;
Sc
Respond shared;
Supply data;
Sm
ECE/CS 757; copyright J. E. Smith, 2007
Cache Update
I
Respond shared;
Update copy;
Sc
Respond shared;
Update copy;
Sc
Example
Thread Event
Bus Action
Data From
0. Initially:
Memory
Global State
local states
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR
2. T0 write
none
3. T2 read
CR
C0
<1,0,1,0>
Sm
Sc
4. T1 write
CR,CU
C0
<1,1,1,0>
Sc
Sm
Sc
5. T0 read
none (hit)
C0
<1,1,1,0>
Sc
Sm
Sc
Appears
to
require
atomic
bus
cycles
CR,CU
on
write
to
invalid
line
02/07
ECE/CS 757; copyright J. E. Smith, 2007
Update
Protocol:
Firey
Develped
at
DEC
by
ex-Xerox
people
5-State
Protocol
Similar
to
Dragon
dierent
state
naming
based
on
shared/exclusive
and
clean/dirty
Invalid:<0,X,X,X>
EC:
<1,0,0,,1>
SC:
<1,X,X,X>
memory
may
not
be
up-to-date
EM:
<1,0,0,0>
SM:
<1,X,X,X0>
memory
not
up-to-date;
only
one
copy
in
Sm
Performs
write-through
updates
(dierent
from
Dragon)
02/07
ECE/CS 757; copyright J. E. Smith, 2007
29
Firey
State
Diagram
Action and Next State
Current
State
Processor
Read
Processor
Write
Eviction
Cache Read
No Action
I
Cache Read&M
Cache Read
If no sharers:
Ec
If sharers:
Sc
Cache Read
If no sharers:
Em
If sharers:
Cache Update
Sm
Sc
No Action
Sc
Cache Read&M
If no sharers:
Ec
If sharers:
Sc
No Action
I
Ec
No Action
Ec
No Action
No Action
I
Respond shared;
Sc
Respond Shared
Sc
Respond Shared;
Sc
No Action
I
Respond Shared
Sc
Em
02/07
Sm
No Action
Sm
Cache Read&M
If no sharers:
Ec
If sharers:
Sc
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Supply data;
Sc
Em
No Action
Em
No Action
Em
Cache Write-back
I
Respond shared;
Supply data;
Sm
Respond shared;
Supply data;
Sc
ECE/CS 757; copyright J. E. Smith, 2007
Update
vs
Invalidate
[Weber
&
Gupta,
ASPLOS3]
Consider
sharing
pamerns
No
Sharing
Independent
threads
Coherence
due
to
thread
migra>on
Update
protocol
performs
many
wasteful
updates
Read-Only
No
signicant
coherence
issues;
most
protocols
work
well
Migratory
Objects
Manipulated
by
one
processor
at
a
>me
O[en
protected
by
a
lock
Usually
a
write
causes
only
a
single
invalida>on
E
state
useful
for
Read-modify-Write
pamerns
Update
protocol
could
proliferate
copies
02/07
ECE/CS 757; copyright J. E. Smith, 2007
31
Update
vs
Invalidate,
contd.
Synchroniza>on
Objects
Locks
Update
could
reduce
spin
trac
invalida>ons
Test&Test&Set
w/
invalidate
protocol
would
work
well
Many
Readers,
One
Writer
Update
protocol
may
work
well,
but
writes
are
rela>vely
rare
Many
Writers/Readers
Invalidate
probably
works
bemer
Update
will
proliferate
copies
What
is
used
today?
Invalidate
is
dominant
CMP
may
change
this
assessment
more
on-chip
bandwidth
02/07
ECE/CS 757; copyright J. E. Smith, 2007
32
Nasty
Reali>es
State
diagram
is
for
(ideal)
protocol
assuming
instantaneous
and
ac>ons
In
reality
controller
implements
more
complex
diagrams
A
protocol
state
transi>on
may
be
started
by
controller
when
bus
ac>vity
changes
local
state
Example:
an
upgrade
pending
(for
bus)
when
an
invalidate
for
same
line
arrives
02/07
ECE/CS 757; copyright J. E. Smith, 2007
Par>al
Implementa>on
State
Table
Action and Next State
Current
State
Processor
Read
Processor
Write
Bus
Grant
Bus
Response
Cache Read
Cache Read&M
Cache
Upgrade
Request Bus
IR
Request
Bus
IW
No Action
I
No Action
I
No Action
I
No Action
S
Request
Bus
SW
Respond Shared:
S
No Action
I
No Action
I
No Action
E
No Action
M
Respond Shared;
S
No Action
I
No Action
M
No Action
M
Respond dirty;
Write back data;
S
Respond dirty;
Write back data;
I
Respond Shared:
No Action
IW
IR
Cache Read
IRR
IW
Cache Read&M
IWR
IRR
If no
sharers:
E
If sharers:
S
Load line
IWR
M
Load line
SW
02/07
Cache Upgrade
ECE/CS
M
757; copyright J. E. Smith,
2007
SW
No Action
IW
Further
Op>miza>ons
Observa>on:
Shared
blocks
should
only
be
fetched
from
memory
once
If
I
nd
a
shared
block
on
chip,
forward
the
block
Problem:
mul>ple
shared
blocks
possible,
who
forwards?
Everyone?
Power/bandwidth
wasted
Single
forwarder,
but
who?
Last
one
to
receive
block:
F
state
I->F
for
requestor,
F->S
for
forwarder
What
if
F
block
is
evicted?
Favor
F
blocks
in
replacement?
Dont
allow
silent
evic>on
(force
some
other
node
to
be
F)
Fall
back
on
memory
copy
if
cant
nd
F
copy
Very
old
idea
(IBM
machines
have
done
this
for
a
long
>me),
but
recent
Intel
patent
issued
anyway
[Hum/Goodman]
2005 Mikko Lipasti
35
Further
Op>miza>ons
Observa>on:
migratory
data
o[en
ies
by
Add
T
(transi>on)
state
to
protocol
Tag
is
s>ll
valid,
data
isnt
Data
can
be
snarfed
as
it
ies
by
Only
works
with
certain
kinds
of
interconnect
networks
Replacement
policy
issues
Many
other
op>miza>ons
are
possible
Literature
extends
25
years
back
Many
unpublished
(but
implemented)
techniques
as
well
2005 Mikko Lipasti
36
Implemen>ng
Cache
Coherence
Snooping
implementa>on
Origins
in
shared-memory-bus
systems
All
CPUs
could
observe
all
other
CPUs
requests
on
the
bus;
hence
snooping
Bus
Read,
Bus
Write,
Bus
Upgrade
React
appropriately
to
snooped
commands
Invalidate
shared
copies
Provide
up-to-date
copies
of
dirty
lines
Flush
(writeback)
to
memory,
or
Direct
interven>on
(modied
interven2on
or
dirty
miss)
Snooping
suers
from:
Scalability:
shared
busses
not
prac>cal
Ordering
of
requests
without
a
shared
bus
Lots
of
recent
and
on-going
work
on
scaling
snoop-based
systems
2005 Mikko Lipasti
37
Snooping
Cache
Coherence
Basic
idea:
broadcast
snoop
to
all
caches
to
nd
owner
Not
scalable?
Address
trac
roughly
propor>onal
to
square
of
number
of
processors
Current
implementa>ons
scale
to
64/128-way
(Sun/IBM)
with
mul>ple
address-interleaved
broadcast
networks
Inbound
snoop
bandwidth:
big
problem
OutboundSnoopRate = s o =
CacheMissRate + BusUpgradeRate
InboundSnoopRate = si = n
2005 Mikko Lipasti
so
38
Snoop
Bandwidth
l
l
Snoop
ltering
of
various
kinds
is
possible
Filter
snoops
at
sink:
Jemy
lter
[Moshovos
et
al.,
HPCA
2001]
Filter
snoops
at
source:
Mul>cast
snooping
[Bilir
et
al.,
ISCA
1999]
Filter
snoops
at
source:
Region
coherence
Check
small
lter
cache
that
summarizes
contents
of
local
cache
Avoid
power-hungry
lookups
in
each
tag
array
Predict
likely
sharing
set,
snoop
only
predicted
sharers
Double-check
at
directory
to
make
sure
Concurrent
work:
[Can>n/Smith/Lipas>,
ISCA
2005;
Moshovos,
ISCA
2005]
Check
larger
region
of
memory
on
every
snoop;
remember
when
no
sharers
Snoop
only
on
rst
reference
to
region,
or
when
region
is
shared
Eliminate
60%+
of
all
snoops
2005 Mikko Lipasti
39
Snoop
Latency
l
Snoop latency:
Must reach all nodes, return and combine responses
Probably the bigger scalability issue in the future
Topology matters: ring, mesh, torus, hypercube
No obvious solutions
Parallelism: fundamental advantage of snooping
Broadcast exposes parallelism, enables speculative latency
reduction
LDir XSnp RDir XRsp CRsp XRd RDat XDat UDat
RDat XDat
UDat
RDat XDat UDat
RDat
XDat UDat
2005 Mikko Lipasti
40
Scaleable
Cache
Coherence
Eschew
physical
bus
but
s>ll
snoop
Point-to-point
tree
structure
(indirect)
or
ring
Root
of
tree
or
ring
provide
ordering
point
Use
some
scalable
network
for
data
(ordering
less
important)
Or,
use
level
of
indirec>on
through
directory
Directory
at
memory
remembers:
Which
processor
is
single
writer
Which
processors
are
shared
readers
Level
of
indirec>on
has
a
price
Dirty
misses
require
3
hops
instead
of
two
Snoop:
Requestor->Owner
Directory:
Requestor->Directory->Owner
2005 Mikko Lipasti
41
Implemen>ng
Cache
Coherence
Directory
implementa>on
Extra
bits
stored
in
memory
(directory)
record
state
of
line
Memory
controller
maintains
coherence
based
on
the
current
state
Other
CPUs
commands
are
not
snooped,
instead:
Directory
forwards
relevant
commands
Powerful
ltering
eect:
only
observe
commands
that
you
need
to
observe
Meanwhile,
bandwidth
at
directory
scales
by
adding
memory
controllers
as
you
increase
size
of
the
system
Leads
to
very
scalable
designs
(100s
to
1000s
of
CPUs)
Directory
shortcomings
Indirec>on
through
directory
has
latency
penalty
If
shared
line
is
dirty
in
other
CPUs
cache,
directory
must
forward
request,
adding
latency
This
can
severely
impact
performance
of
applica>ons
with
heavy
sharing
(e.g.
rela>onal
databases)
2005 Mikko Lipasti
42
Directory
Protocol
Implementa>on
Basic
idea:
Centralized
directory
keeps
track
of
data
loca>on(s)
Scalable
Address
trac
roughly
propor>onal
to
number
of
processors
Directory
&
trac
can
be
distributed
with
memory
banks
(interleaved)
Directory
cost
(SRAM)
or
latency
(DRAM)
can
be
prohibi>ve
Full
map
(N
processors,
N
bits):
cost/scalability
Limited
map
(limits
number
of
sharers)
Coarse
map
(iden>es
board/node/cluster;
must
use
broadcast)
Point
to
shared
copies
Fixed
number,
linked
lists
(SCI),
caches
chained
together
Latency
vs.
cost
vs.
scalability
Presence
bits
track
sharers
Vectors
track
sharers
2005 Mikko Lipasti
43
Directory
Protocol
Latency
LDir XSnp RDir XRd RDat XDat UDat
Access
to
non-shared
data
Overlap
directory
read
with
data
read
Best
possible
latency
given
NUMA
arrangement
Access
to
shared
data
Dirty
miss,
modied
interven>on
Shared
interven>on?
If
DRAM
directory,
no
gain
If
directory
cache,
possible
gain
(use
F
state)
No
inherent
parallelism
Indirec>on
adds
latency
Minimum
3
hops,
o[en
4
hops
2005 Mikko Lipasti
44
Directory-based
Cache
Coherence
An
alterna>ve
for
large,
scalable
MPs
Can
be
based
on
any
of
the
protocols
discussed
thus
far
We
will
use
MSI
Memory
Controller
becomes
an
ac>ve
par>cipant
Sharing
info
held
in
memory
directory
Memory
Module
Memory
Module
Directory
Directory
...
Memory
Module
Directory
Interconnection Network
Directory
may
be
distributed
Use
point-to-point
messages
Network
is
not
totally
ordered
02/07
Cache
Cache
Processor
Processor
ECE/CS 757; copyright J. E. Smith, 2007
...
Cache
Processor
Example:
Simple
Directory
Protocol
Local
cache
controller
states
M,
S,
I
as
before
Local
directory
states
Shared:
<1,X,X,1>;
one
or
more
proc.
has
copy;
memory
is
up-to-date
Modied:
<0,1,0,.,0>
one
processor
has
copy;
memory
does
not
have
a
valid
copy
Uncached:
<0,0,0,1>
none
of
the
processors
has
a
valid
copy
Directory
also
keeps
track
of
sharers
Can
keep
global
state
vector
in
full
e.g.
via
a
bit
vector
02/07
ECE/CS 757; copyright J. E. Smith, 2007
46
Example
Local
cache
suers
load
miss
Line
in
remote
cache
in
M
state
It
is
the
owner
Four
messages
send
over
network
Cache
read
from
local
controller
to
home
memory
controller
Memory
read
to
remote
cache
controller
Owner
data
back
to
memory
controller;
change
state
to
S
Memory
data
back
to
local
cache;
change
state
to
S
02/07
Processor
Processor
processor
read
Cache
Local
Controller
Cache
Owner
Controller
memory
data
response
Processor
...
owner
data
response
Remote
Controller
Cache
Interconnect
memory
read
cache
read
Memory
Controller
Memory
Controller
Memory
Controller
Directory
Directory
Directory
Memory
Banks
Memory
Banks
ECE/CS 757; copyright J. E. Smith, 2007
...
Memory
Banks
Cache
Controller
State
Diagram
Intermediate
States
for
clarity
Cache Controller
Actions and Next States
from Processor Side
Current
State
Processor
Read
Processor
Write
from Memory Side
Eviction
Cache
Read
I'
Cache
Read&M
I''
No
Action
S
Cache
Upgrade
S'
No
Action*
I
No
Action
M
No Action
M
Cache
Write-back
I
Memory
Read
Memory
Read&M
Memory
Invalidate
Memory
Upgrade
No Action
I
Invalidate
Frame;
Cache ACK;
I
Owner
Data;
S
Owner
Data;
I
Invalidate
Frame;
Cache ACK;
I
I'
Fill Cache
S
I''
Fill Cache
M
S'
02/07
Memory Data
No Action
M
ECE/CS 757; copyright J. E. Smith, 2007
48
Memory
Controller
State
Diagram
Memory Controller
Actions and Next States
command from Local Cache Controller
Current
Directory
State
Cache
Read
Cache Read&M
Cache
Upgrade
Memory Data;
Add Requestor to
Sharers;
S
Memory Data;
Add Requestor to
Sharers;
M
Memory Data;
Add Requestor to
Sharers;
S
Memory
Invalidate All
Sharers;
M'
Memory Read
from Owner;
S'
Memory Read&M;
to Owner
M'
Memory
Upgrade
All Sharers;
M''
response from Remote Cache Controller
Data
Write-back
Cache ACK
No Action
I
Make Sharers
Empty;
U
S'
Memory Data
to Requestor;
Write memory;
Add Requestor to
Sharers;
S
M'
When all ACKS
Memory Data;
M
M''
When all ACKS
then
M
02/07
Owner
Data
ECE/CS 757; copyright J. E. Smith, 2007
Memory Data
to Requestor;
M
49
Another
Example
Local
write
(miss)
to
shared
line
Requires
invalida>ons
and
acks
Processor
Processor
processor
write
Cache
Local
Controller
cache
Read&M
02/07
Cache
Remote
Controller
memory
data
response
memory
invalidate
Processor
...
Remote
Controller
Cache
cache
ack
cache
ack
Interconnect
Memory
Controller
Home Memory
Controller
Memory
Controller
Directory
Directory
Directory
Memory
Banks
Memory
Banks
ECE/CS 757; copyright J. E. Smith, 2007
...
Memory
Banks
Example
Sequence
Similar
to
earlier
sequences
Thread Event
Controller
Actions
Data From
0. Initially:
local states:
C0 C1 C2
<0,0,0,1>
<1,0,0,1>
<1,0,0,0>
1. T0 read
CR,MD
2. T0 write
CU, MU*,MD
3. T2 read
CR,MR,MD
C0
<1,0,1,1>
4. T1 write
CRM,MI,CA,MD
Memory
<0,1,0,0>
02/07
Memory
global state
ECE/CS 757; copyright J. E. Smith, 2007
Varia>on:
Three
Hop
Protocol
Have
owner
send
data
directly
to
local
controller
Owner
Acks
to
Memory
Controller
in
parallel
Local
Controller
cache
read
Owner
Controller
memory
data
memory
read
owner
data
owner
data
Local
Controller
cache
read
3
memory
read
Memory
Controller
Memory
Controller
a)
02/07
Owner
Controller
b)
ECE/CS 757; copyright J. E. Smith, 2007
owner
ack
Directory
Protocol
Op>miza>ons
Remove
dead
blocks
from
cache:
Eliminate
3-
or
4-hop
latency
Dynamic
Self-Invalida>on
[Lebeck/Wood,
ISCA
1995]
Last
touch
predic>on
[Lai/Falsa,
ISCA
2000]
Dead
block
predic>on
[Lai/Fide/Falsa,
ISCA
2001]
Predic>on
in
coherence
protocols
[Mukherjee/Hill,
ISCA
1998]
Instruc>on-based
predic>on
[Kaxiras/Goodman,
ISCA
1999]
Sharing
predic>on
[Lai/Falsa,
ISCA
1999]
Improve
latency
by
snooping,
conserve
bandwidth
with
directory
Mul>cast
snooping
[Bilir
et
al.,
ISCA
1999;
Mar>n
et
al.,
ISCA
2003]
Bandwidth-adap>ve
hybrid
[Mar>n
et
al.,
HPCA
2002]
Token
Coherence
[Mar>n
et
al.,
ISCA
2003]
VCT
Mul>cas>ng
[Enright
Jerger
thesis
2008]
Predict
sharers
Hybrid
snooping/directory
protocols
2005 Mikko Lipasti
53
Atomic
bus
Protocol
Races
Only
stable
states
in
protocol
(e.g.
M,
S,
I)
All
state
transi>ons
are
atomic
(I->M)
No
conic>ng
requests
can
interfere
since
bus
is
held
>ll
transac>on
completes
Dis>nguish
coherence
transac>on
from
data
transfer
Data
transfer
can
s>ll
occur
much
later;
easier
to
handle
this
case
Atomic
buses
dont
scale
At
minimum,
separate
bus
request/response
Large
systems
have
broadly
variable
delays
Req/resp
separated
by
dozens
of
cycles
Conic>ng
requests
can
and
do
get
issued
Messages
may
get
reordered
in
the
interconnect
How
do
we
resolve
them?
54
Resolving
Protocol
Races
Req/resp
decoupling
introduces
transient
states
E.g.
I->S
is
now
I->ItoX->ItoS_nodata->S
Conic>ng
requests
to
blocks
in
transient
states
NAK
ugly;
livelock,
starva>on
poten>al
Keep
adding
more
transient
states
Directory
protocol
makes
this
a
bit
easier
Can
order
at
directory,
which
has
full
state
info
Even
so,
messages
may
get
reordered
55
Common
Protocol
Races
Read
strings:
P0
read,
P1
read,
P2
read
Easy,
since
read
is
nondestruc>ve
Can
rely
on
F
state
to
reduce
DRAM
accesses
Forward
reads
to
previous
requestor
(F)
Write
strings:
P0
write,
P1
write,
P2
write
Forward
P1
write
req
to
P0
(M)
P0
completes
write
then
forwards
M
block
to
P1
Build
string
of
writes
(write
string
forwarding)
Read
a[er
write
(similar
to
prev.
WAW)
Writeback
race:
P0
evicts
dirty
block,
P1
reads
Dirty
block
is
in
the
network
(no
copy
at
P0
or
at
dir)
NAK
P1,
or
force
P0
to
keep
copy
>ll
dir
ACKs
WB
Many
others
crop
up,
esp.
with
op>miza>ons
56
Summary
Coherence
States
Snoopy
bus-based
Invalidate
Protocols
Invalidate
protocol
op>miza>ons
Update
Protocols
(Dragon/Firey)
Directory
protocols
Implementa>on
issues
57