0% found this document useful (0 votes)

323 views57 pages

Cache Coherence - MESI MOESI

Cache Coherancy Notes

Uploaded by

Mohd Imran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

323 views57 pages

Cache Coherence - MESI MOESI

Cache Coherancy Notes

Uploaded by

Mohd Imran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

ECE/CS

757: Advanced
Computer Architecture II
Instructor:Mikko H Lipas>

Spring 2009
University of Wisconsin-Madison

Lecture notes based on slides created by John Shen,
Mark Hill, David Wood, Guri Sohi, and Jim Smith,
Natalie Enright Jerger, and probably others

Cache Coherence

Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol op>miza>ons
Update Protocols (Dragon/Firey)
Directory protocols
Implementa>on issues

Readings
Readings:
Firey
Archibald
Sweazey/Smith
Laudon/Lenoski: Origin 2000
Opteron
Gigaplane
Power5
Intel 870
3

Cache Coherence Problem

Load A
Store A<= 1

Load A
Load A

Memory

2005 Mikko Lipasti

Cache Coherence Problem

Load A
Store A<= 1

Load A
Load A

Memory

2005 Mikko Lipasti

Possible Causes of Incoherence

Sharing of writeable data
Cause most commonly considered

Process migra>on
Can occur even if independent jobs are execu>ng

I/O
O[en xed via O/S cache ushes

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Cache Coherence

Informally, with coherent caches: accesses to

a memory loca>on appear to occur
simultaneously in all copies of the memory
loca>on
copies caches

Cache coherence suggests an absolute >me

scale -- this is not necessary
What is required is the "appearance" of
coherence... not absolute coherence
E.g. temporary incoherence between memory and
a write-back cache
may be OK.
02/07
ECE/CS 757; copyright J. E. Smith, 2007
7

Update vs.
Invalida>on
Protocols
Coherent Shared
Memory
All processors see
the eects of
others writes

How/when writes
are propagated
Determine by
coherence
protocol

2005 Mikko Lipasti

Global Coherence States

A memory line can be present (valid) in any of the caches and/or
memory
Represent global state with an N+1 element vector
First N components => cache states (valid/invalid)
N+1st component => memory state (valid/invalid)

Example:
Line A: <1,1,0,1>
Line B: <1,0,0,0>
Line C: <0,0,0,1>

Cache 0

line A: V
line B: I
line C: V

Cache 1
line A
line B

02/07

Memory

Cache 2

line A

ECE/CS 757; copyright J. E. Smith, 2007

Local Coherence States

Individual caches can maintain a summary of the state of memory
lines, from a local perspec>ve
Reduces storage for maintaining state
May have only par>al informa>on
Invalid (I): <0,X,X,X....X> -- local cache does not have a valid copy; (cache
miss)
Dont confuse invalid state with empty frame

Shared (S): <1,X,X,X,,1> -- local cache has a valid copy, main memory has
a valid copy, other caches ??
Modied(M): <1,0,0,..0,0> -- local cache has only valid copy.
Exclusive(E): <1,0,0,..0,1> -- local cache has a valid copy, no other caches
do, main memory has a valid copy.
Owned(O): <1,X,X,X,.X> -- local cache has a valid copy, all other caches
and memory may have a valid copy.
Only one cache can be in O state
<1,X,1,X, 0> is included in O, but not included in any of the others.
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Example
Memory
line A: V
line B: I
line C: V

Cache 0
line A
line B

Cache 1

Cache 2

line A

Memory
line A: V
line B: I
line C: V

Cache 0
line A: S
line B: M
line C: I

02/07

Cache 1
line A: S
line B: I
line C: I

Cache 2
line A: I
line B: I
line C: I

ECE/CS 757; copyright J. E. Smith, 2007

Snoopy Cache Coherence

All requests broadcast on bus
All processors and memory snoop and respond
Cache blocks writeable at one processor or read-
only at several
Single-writer protocol

Snoops that hit dirty lines?

Flush modied data out of cache
Either write back to memory, then sa>sfy remote miss
from memory, or
Provide dirty data directly to requestor
Big problem in MP systems
Dirty/coherence/sharing misses
2005 Mikko Lipasti

Bus-Based Protocols
Protocol consists of
states and ac>ons
(state transi>ons)
Ac>ons can be
invoked from
processor or bus

Bus
Bus Actions

Cache
Controller

Processor Actions

Processor

02/07

ECE/CS 757; copyright J. E. Smith, 2007

State

2005 Mikko Lipasti

MSI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache Read
Acquire
Copy
S

Cache Read&M
Acquire Copy
M

No Action
S

Cache Upgrade
M

No Action
M

Cache
Read

Cache
Read&M

Cache
Upgrade

No Action
I

No Action
S

Invalidate
Frame
I

Cache
Write
back
I

Memory
inhibit;
Supply
data;
S

Invalidate
Frame;
Memory
inhibit;
Supply data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MSI Example
Thread Event

Bus Action

Data From

0. Initially:
Memory

Global State

Local States:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

2. T0 write

3. T2 read

<1,0,1,1>

4. T1 write

CRM

Memory

<0,1,0,0>

If line is in no cache

Read, modify, Write requires 2 bus transac>ons

Op>miza>on: add Exclusive state

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Invalidate Protocol Op>miza>ons

Observa>on: data can be read shared
Add S (shared) state to protocol: MSI

State transi>ons:

Local read: I->S, fetch shared

Local write: I->M, fetch modied; S->M, invalidate other copies
Remote read: M->S, supply data
Remote write: M->I, supply data; S->I, invalidate local copy

Observa>on: data can be write-private (e.g. stack frame)

Avoid invalidate messages in that case
Add E (exclusive) state to protocol: MESI

State transi>ons:

Local read: I->E if only copy, I->S if other copies exist

Local write: E->M silently, S->M, invalidate other copies
2005 Mikko Lipasti

MESI Protocol
Varia>on used in Pen>um Pro/2/3
(cache to cache transfer)
4-State Protocol
Modied: <1,0,00>
Exclusive: <1,0,0,,1>
Shared: <1,X,X,,1>
Invalid: <0,X,X,X>

Bus/Processor Ac>ons
Same as MSI

Adds shared signal to indicate if other caches have a copy

02/07

ECE/CS 757; copyright J. E. Smith, 2007

MESI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache
Read
If no
sharers:
E
If sharers:
S

Cache Read&M
M

No Action
S

Cache Upgrade
M

No Action
E

No Action
M

No Action
M
No Action
M

Cache
Read

Cache
Read&M

Cache
Upgrade

No Action
I

Respond
Shared:
S

No Action
I

Respond
Shared;
S

No Action
I

Cache
Write-back
I

Respond
dirty;
Write back
data;
S

Respond
dirty;
Write back
data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MESI Example

Op>miza>on cache-to-cache transfer for shared data (but only one

sharer can respond)
If modied in some cache and another reads then writeback to memory
(w/ snarf)
Op>miza>on: add Owned state (and perform cache-to-cache transfer)
Thread Event

Bus
Action

Data From

0. Initially:
1. T0 read

2. T0 write

none

02/07

Memory

Global State

Local States:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Op>miza>on
Observa>on: shared ownership complicates/delays sourcing
data
Owner is responsible for sourcing data to any requestor
Add O (owner) state to protocol: MOSI/MOESI
Last requestor becomes the owner
Ownership can be on per-node basis in hierarchically structured
system
Avoid writeback (to memory) of dirty data
Also called shared-dirty state, since memory is stale

2005 Mikko Lipasti

MOESI Protocol

Used in AMD Opteron

5-State Protocol

Modied: <1,0,00>
Exclusive: <1,0,0,,1>
Shared: <1,X,X,,X>
Invalid: <0,X,X,X>
Owned: <1,X,X,X,X> ; only one owner

Owner can supply data, so memory does not have to

Avoids lengthy memory access

02/07

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Protocol
Action and Next State
Current
State

02/07

Processor
Read

Processor
Write

Eviction

Cache Read

Cache Read&M

Cache
Upgrade

Cache Read
If no sharers:
E
If sharers:
S

Cache Read&M
M

No Action
I

No
Action
I

No Action
S

Cache Upgrade
M

No Action
I

Respond
shared;
S

No Action
I

No
Action
I

No Action
E

No Action
M

No Action
I

Respond
shared;
Supply data;
S

Respond
shared;
Supply data;
I

No Action
O

Cache
Upgrade
M

Cache
Write-back
I

Respond
shared;
Supply data;
O

Respond
shared;
Supply data;
I

No Action
M

Cache
Write-back
I

Respond
shared;
Supply data;
O

Respond
shared;
Supply data;
I

ECE/CS 757; copyright J. E. Smith, 2007

MOESI Example
Thread Event

Bus Action

Data From

0. Initially:

local states
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

2. T0 write

none

3. T2 read

<1,0,1,0>

4. T1 write

CRM

<0,1,0,0>

02/07

Memory

Global State

ECE/CS 757; copyright J. E. Smith, 2007

Update Protocols
Basic idea:

All writes (updates) are made visible to all caches:

(address,value) tuples sent everywhere

Similar to write-through protocol for uniprocessor caches

Obviously not scalable beyond a few processors

No one actually builds machines this way

Simple op>miza>on

Send updates to memory/directory

Directory propagates updates to all known copies: less bandwidth

Further op>miza>ons: combine & delay

Write-combining of adjacent updates (if consistency model allows)

Send write-combined data
Delay sending write-combined data un>l requested

Logical end result

Writes are combined into larger units, updates are delayed un>l needed
Eec>vely the same as invalidate protocol
2005 Mikko Lipasti

Update Protocol: Dragon

Dragon (developed at Xerox PARC)
5-State Protocol
Invalid:<0,X,X,X>
Some say no invalid state due to confusion regarding empty frame
versus invalid line state
Exclusive: <1,0,0,,1>
Shared-Clean (Sc): <1,X,X,X> memory may not be up-to-date
Shared-Modied (Sm): <1,X,X,X0> memory not up-to-date; only one
copy in Sm
Modied: <1,0,0,0>

Includes Cache Update ac>on

Includes Cache Writeback ac>on
Bus includes Shared ag
Appears to also require memory inhibit signal
Dis>nguish shared case where cache (not memory) supplies data
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Dragon State Diagram

Action and Next State
Current
State

Processor
Read

Processor
Write

Eviction

Cache Read
If no sharers:
E
If sharers:
Sc

Cache Read
If no sharers:
M
If sharers:
Cache Update
Sm

No Action
Sc

Cache Update
If no sharers:
M
If sharers:
Sm

No Action
I

No Action
E

No Action

No Action
I

Respond shared;
Supply data
Sc
Respond shared;
Supply data;
Sm

02/07

Cache Read

No Action
Sm

Cache Update
If no sharers:
M
If sharers:
Sm

Cache Write-back
I

No Action
M

Cache Write-back
I

Respond Shared;
Sc

Respond shared;
Supply data;
Sm
ECE/CS 757; copyright J. E. Smith, 2007

Cache Update
I

Respond shared;
Update copy;
Sc

Example
Thread Event

Bus Action

Data From

0. Initially:
Memory

Global State

local states
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

2. T0 write

none

3. T2 read

<1,0,1,0>

4. T1 write

CR,CU

<1,1,1,0>

5. T0 read

none (hit)

<1,1,1,0>

Appears to require atomic bus cycles CR,CU on write to invalid

line
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Update Protocol: Firey

Develped at DEC by ex-Xerox people
5-State Protocol
Similar to Dragon dierent state naming based on shared/exclusive and
clean/dirty
Invalid:<0,X,X,X>
EC: <1,0,0,,1>
SC: <1,X,X,X> memory may not be up-to-date
EM: <1,0,0,0>
SM: <1,X,X,X0> memory not up-to-date; only one copy in Sm

Performs write-through updates (dierent from Dragon)

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Firey State Diagram

Action and Next State
Current
State

Processor
Read

Processor
Write

Eviction

Cache Read

No Action
I

Cache Read&M

Cache Read
If no sharers:
Ec
If sharers:
Sc

Cache Read
If no sharers:
Em
If sharers:
Cache Update
Sm

No Action
Sc

Cache Read&M
If no sharers:
Ec
If sharers:
Sc

No Action
I

No Action
Ec

No Action

No Action
I

Respond shared;
Sc

Respond Shared
Sc

Respond Shared;
Sc

No Action
I

Respond Shared
Sc

02/07

No Action
Sm

Cache Read&M
If no sharers:
Ec
If sharers:
Sc

Cache Write-back
I

Respond shared;
Supply data;
Sm

Respond shared;
Supply data;
Sc

No Action
Em

Cache Write-back
I

Respond shared;
Supply data;
Sm

Respond shared;
Supply data;
Sc

ECE/CS 757; copyright J. E. Smith, 2007

Update vs Invalidate
[Weber & Gupta, ASPLOS3]
Consider sharing pamerns

No Sharing
Independent threads
Coherence due to thread migra>on
Update protocol performs many wasteful updates

Read-Only
No signicant coherence issues; most protocols work well

Migratory Objects
Manipulated by one processor at a >me
O[en protected by a lock
Usually a write causes only a single invalida>on
E state useful for Read-modify-Write pamerns
Update protocol could proliferate copies
02/07

ECE/CS 757; copyright J. E. Smith, 2007

Update vs Invalidate, contd.

Synchroniza>on Objects

Locks
Update could reduce spin trac invalida>ons
Test&Test&Set w/ invalidate protocol would work well

Many Readers, One Writer

Update protocol may work well, but writes are rela>vely rare

Many Writers/Readers

Invalidate probably works bemer

Update will proliferate copies

What is used today?

Invalidate is dominant
CMP may change this assessment
more on-chip bandwidth

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Nasty Reali>es
State diagram is for (ideal) protocol assuming
instantaneous and ac>ons
In reality controller implements more complex diagrams
A protocol state transi>on may be started by controller when
bus ac>vity changes local state
Example: an upgrade pending (for bus) when an invalidate for
same line arrives

02/07

ECE/CS 757; copyright J. E. Smith, 2007

Par>al Implementa>on State Table

Action and Next State
Current
State

Processor
Read

Processor
Write

Bus
Grant

Bus
Response

Cache Read

Cache Read&M

Cache
Upgrade

Request Bus
IR

Request
Bus
IW

No Action
I

No Action
S

Request
Bus
SW

Respond Shared:
S

No Action
I

No Action
E

No Action
M

Respond Shared;
S

No Action
I

No Action
M

Respond dirty;
Write back data;
S

Respond dirty;
Write back data;
I

Respond Shared:

No Action
IW

Cache Read
IRR

Cache Read&M
IWR

IRR

If no
sharers:
E
If sharers:
S
Load line

IWR

M
Load line

02/07

Cache Upgrade
ECE/CS
M

757; copyright J. E. Smith,

2007
SW

No Action
IW

Further Op>miza>ons
Observa>on: Shared blocks should only be fetched from
memory once
If I nd a shared block on chip, forward the block
Problem: mul>ple shared blocks possible, who forwards?
Everyone? Power/bandwidth wasted

Single forwarder, but who?

Last one to receive block: F state

I->F for requestor, F->S for forwarder

What if F block is evicted?

Favor F blocks in replacement?

Dont allow silent evic>on (force some other node to be F)
Fall back on memory copy if cant nd F copy

Very old idea (IBM machines have done this for a long >me),
but recent Intel patent issued anyway [Hum/Goodman]

2005 Mikko Lipasti

Further Op>miza>ons
Observa>on: migratory data o[en ies by

Add T (transi>on) state to protocol

Tag is s>ll valid, data isnt
Data can be snarfed as it ies by
Only works with certain kinds of interconnect networks
Replacement policy issues

Many other op>miza>ons are possible

Literature extends 25 years back

Many unpublished (but implemented) techniques as well

2005 Mikko Lipasti

Implemen>ng Cache Coherence

Snooping implementa>on

Origins in shared-memory-bus systems

All CPUs could observe all other CPUs requests on the bus;
hence snooping

Bus Read, Bus Write, Bus Upgrade

React appropriately to snooped commands

Invalidate shared copies

Provide up-to-date copies of dirty lines
Flush (writeback) to memory, or
Direct interven>on (modied interven2on or dirty miss)

Snooping suers from:

Scalability: shared busses not prac>cal

Ordering of requests without a shared bus
Lots of recent and on-going work on scaling snoop-based
systems
2005 Mikko Lipasti

Snooping Cache Coherence

Basic idea: broadcast snoop to all caches to nd owner
Not scalable?
Address trac roughly propor>onal to square of number of
processors
Current implementa>ons scale to 64/128-way (Sun/IBM) with
mul>ple address-interleaved broadcast networks

Inbound snoop bandwidth: big problem

OutboundSnoopRate = s o =

CacheMissRate + BusUpgradeRate

InboundSnoopRate = si = n

2005 Mikko Lipasti

Snoop Bandwidth
l
l

Snoop ltering of various kinds is possible

Filter snoops at sink: Jemy lter [Moshovos et al., HPCA 2001]

Filter snoops at source: Mul>cast snooping [Bilir et al., ISCA 1999]

Filter snoops at source: Region coherence

Check small lter cache that summarizes contents of local

cache
Avoid power-hungry lookups in each tag array
Predict likely sharing set, snoop only predicted sharers
Double-check at directory to make sure

Concurrent work: [Can>n/Smith/Lipas>, ISCA 2005; Moshovos, ISCA 2005]

Check larger region of memory on every snoop; remember
when no sharers
Snoop only on rst reference to region, or when region is shared
Eliminate 60%+ of all snoops

2005 Mikko Lipasti

Snoop Latency
l

Snoop latency:

Must reach all nodes, return and combine responses

Probably the bigger scalability issue in the future
Topology matters: ring, mesh, torus, hypercube
No obvious solutions

Parallelism: fundamental advantage of snooping

Broadcast exposes parallelism, enables speculative latency

reduction

LDir XSnp RDir XRsp CRsp XRd RDat XDat UDat

RDat XDat
UDat
RDat XDat UDat
RDat
XDat UDat
2005 Mikko Lipasti

Scaleable Cache Coherence

Eschew physical bus but s>ll snoop

Point-to-point tree structure (indirect) or ring

Root of tree or ring provide ordering point
Use some scalable network for data (ordering less
important)

Or, use level of indirec>on through directory

Directory at memory remembers:
Which processor is single writer
Which processors are shared readers

Level of indirec>on has a price

Dirty misses require 3 hops instead of two
Snoop: Requestor->Owner
Directory: Requestor->Directory->Owner
2005 Mikko Lipasti

Implemen>ng Cache Coherence

Directory implementa>on

Extra bits stored in memory (directory) record state of line

Memory controller maintains coherence based on the current state
Other CPUs commands are not snooped, instead:

Directory forwards relevant commands

Powerful ltering eect: only observe commands that you need to

observe
Meanwhile, bandwidth at directory scales by adding memory
controllers as you increase size of the system

Leads to very scalable designs (100s to 1000s of CPUs)

Directory shortcomings

Indirec>on through directory has latency penalty

If shared line is dirty in other CPUs cache, directory must forward
request, adding latency
This can severely impact performance of applica>ons with heavy
sharing (e.g. rela>onal databases)

2005 Mikko Lipasti

Directory Protocol Implementa>on

Basic idea: Centralized directory keeps track of data

loca>on(s)
Scalable

Address trac roughly propor>onal to number of processors

Directory & trac can be distributed with memory banks
(interleaved)
Directory cost (SRAM) or latency (DRAM) can be prohibi>ve

Full map (N processors, N bits): cost/scalability

Limited map (limits number of sharers)
Coarse map (iden>es board/node/cluster; must use broadcast)

Point to shared copies

Fixed number, linked lists (SCI), caches chained together
Latency vs. cost vs. scalability

Presence bits track sharers

Vectors track sharers

2005 Mikko Lipasti

Directory Protocol Latency

LDir XSnp RDir XRd RDat XDat UDat

Access to non-shared data

Overlap directory read with data read

Best possible latency given NUMA arrangement

Access to shared data

Dirty miss, modied interven>on

Shared interven>on?

If DRAM directory, no gain

If directory cache, possible gain (use F state)

No inherent parallelism
Indirec>on adds latency
Minimum 3 hops, o[en 4 hops

2005 Mikko Lipasti

Directory-based Cache Coherence

An alterna>ve for large, scalable
MPs
Can be based on any of the
protocols discussed thus far
We will use MSI

Memory Controller becomes an

ac>ve par>cipant
Sharing info held in memory
directory

Memory
Module

Directory may be distributed

Use point-to-point messages

Network is not totally ordered

02/07

Cache

Processor

...

Cache

Processor

Example: Simple Directory Protocol

Local cache controller states
M, S, I as before

Local directory states

Shared: <1,X,X,1>; one or more proc. has copy; memory
is up-to-date
Modied: <0,1,0,.,0> one processor has copy; memory
does not have a valid copy
Uncached: <0,0,0,1> none of the processors has a valid
copy

Directory also keeps track of sharers

Can keep global state vector in full
e.g. via a bit vector
02/07

Example
Local cache suers load miss
Line in remote cache in M state
It is the owner

02/07

Processor
Processor

processor
read

Cache

Local
Controller

Cache

Owner
Controller

memory
data
response

Processor

...

owner
data
response

Remote
Controller

Cache

Interconnect

memory
read

cache
read

Memory
Controller

...

Memory
Banks

Cache Controller State Diagram

Intermediate States for clarity
Cache Controller
Actions and Next States
from Processor Side
Current
State

Processor
Read

Processor
Write

from Memory Side

Eviction

Cache
Read
I'

Cache
Read&M
I''

No
Action
S

Cache
Upgrade
S'

No
Action*
I

No
Action
M

No Action
M

Cache
Write-back
I

Memory
Read

Memory
Read&M

Memory
Invalidate

Memory
Upgrade

No Action
I
Invalidate
Frame;
Cache ACK;
I
Owner
Data;
S

Owner
Data;
I

Invalidate
Frame;
Cache ACK;
I

Fill Cache
S

I''

Fill Cache
M

02/07

Memory Data

No Action
M

Memory Controller State Diagram

Memory Controller
Actions and Next States

command from Local Cache Controller

Current
Directory
State

Cache
Read

Cache Read&M

Cache
Upgrade

Memory Data;
Add Requestor to
Sharers;
S

Memory Data;
Add Requestor to
Sharers;
M

Memory Data;
Add Requestor to
Sharers;
S

Memory
Invalidate All
Sharers;
M'

Memory Read
from Owner;
S'

Memory Read&M;
to Owner
M'

Memory
Upgrade
All Sharers;
M''

response from Remote Cache Controller

Data
Write-back

Cache ACK

No Action
I

Make Sharers
Empty;
U

Memory Data
to Requestor;
Write memory;
Add Requestor to
Sharers;
S

When all ACKS

Memory Data;
M

M''

When all ACKS

then
M

02/07

Owner
Data

Memory Data
to Requestor;
M

Another Example
Local write (miss) to shared line
Requires invalida>ons and acks
Processor
Processor

processor
write

Cache

Local
Controller

cache
Read&M

02/07

Cache

Remote
Controller

memory
data
response
memory
invalidate

Processor

...

Remote
Controller

Cache

cache
ack

Interconnect

Memory
Controller

Home Memory
Controller

Memory
Controller

...

Memory
Banks

Example Sequence
Similar to earlier sequences
Thread Event

Controller
Actions

Data From

0. Initially:

local states:
C0 C1 C2

<0,0,0,1>

<1,0,0,1>

<1,0,0,0>

1. T0 read

CR,MD

2. T0 write

CU, MU*,MD

3. T2 read

CR,MR,MD

<1,0,1,1>

4. T1 write

CRM,MI,CA,MD

Memory

<0,1,0,0>

02/07

Memory

global state

Varia>on: Three Hop Protocol

Have owner send data directly to local controller
Owner Acks to Memory Controller in parallel

Local
Controller

cache
read

Owner
Controller
memory
data

memory
read

owner
data

Local
Controller
cache
read

3
memory
read

Memory
Controller

a)
02/07

Owner
Controller

owner
ack

Directory Protocol Op>miza>ons

Remove dead blocks from cache:

Eliminate 3- or 4-hop latency

Dynamic Self-Invalida>on [Lebeck/Wood, ISCA 1995]
Last touch predic>on [Lai/Falsa, ISCA 2000]
Dead block predic>on [Lai/Fide/Falsa, ISCA 2001]

Predic>on in coherence protocols [Mukherjee/Hill, ISCA 1998]

Instruc>on-based predic>on [Kaxiras/Goodman, ISCA 1999]
Sharing predic>on [Lai/Falsa, ISCA 1999]

Improve latency by snooping, conserve bandwidth with directory

Mul>cast snooping [Bilir et al., ISCA 1999; Mar>n et al., ISCA 2003]
Bandwidth-adap>ve hybrid [Mar>n et al., HPCA 2002]
Token Coherence [Mar>n et al., ISCA 2003]
VCT Mul>cas>ng [Enright Jerger thesis 2008]

Predict sharers

Hybrid snooping/directory protocols

2005 Mikko Lipasti

Atomic bus

Protocol Races

Only stable states in protocol (e.g. M, S, I)

All state transi>ons are atomic (I->M)
No conic>ng requests can interfere since bus is held >ll transac>on
completes
Dis>nguish coherence transac>on from data transfer
Data transfer can s>ll occur much later; easier to handle this case

Atomic buses dont scale

At minimum, separate bus request/response

Large systems have broadly variable delays

Req/resp separated by dozens of cycles
Conic>ng requests can and do get issued
Messages may get reordered in the interconnect

How do we resolve them?

Resolving Protocol Races

Req/resp decoupling introduces transient

states
E.g. I->S is now I->ItoX->ItoS_nodata->S

Conic>ng requests to blocks in transient

states
NAK ugly; livelock, starva>on poten>al
Keep adding more transient states

Directory protocol makes this a bit easier

Can order at directory, which has full state info
Even so, messages may get reordered
55

Common Protocol Races

Read strings: P0 read, P1 read, P2 read

Easy, since read is nondestruc>ve

Can rely on F state to reduce DRAM accesses
Forward reads to previous requestor (F)

Write strings: P0 write, P1 write, P2 write

Forward P1 write req to P0 (M)
P0 completes write then forwards M block to P1
Build string of writes (write string forwarding)

Read a[er write (similar to prev. WAW)

Writeback race: P0 evicts dirty block, P1 reads
Dirty block is in the network (no copy at P0 or at dir)
NAK P1, or force P0 to keep copy >ll dir ACKs WB

Many others crop up, esp. with op>miza>ons

Summary

Coherence States
Snoopy bus-based Invalidate Protocols
Invalidate protocol op>miza>ons
Update Protocols (Dragon/Firey)
Directory protocols
Implementa>on issues

Cache Coherency
No ratings yet
Cache Coherency
33 pages
Cache Coherence: CSE 661 - Parallel and Vector Architectures
No ratings yet
Cache Coherence: CSE 661 - Parallel and Vector Architectures
37 pages
System Verilog Quick View New PDF
100% (3)
System Verilog Quick View New PDF
33 pages
FIFO Design for Engineers
No ratings yet
FIFO Design for Engineers
15 pages
Bromley Coverage Paper
No ratings yet
Bromley Coverage Paper
20 pages
UVM Harness Whitepaper
No ratings yet
UVM Harness Whitepaper
12 pages
Fun With Uvm Sequences Coding and Debugging - VH v15 I12
100% (1)
Fun With Uvm Sequences Coding and Debugging - VH v15 I12
11 pages
UVM Test Termination Guide
100% (1)
UVM Test Termination Guide
24 pages
Class 12 Topics
No ratings yet
Class 12 Topics
43 pages
UVM Register Back Door Access
No ratings yet
UVM Register Back Door Access
7 pages
VCS Commands Ease Coverage Efforts - Speed Simulation PDF
No ratings yet
VCS Commands Ease Coverage Efforts - Speed Simulation PDF
6 pages
Uvm Ral
No ratings yet
Uvm Ral
33 pages
AXI Implementation On SoC
No ratings yet
AXI Implementation On SoC
5 pages
AXI Verification IP
No ratings yet
AXI Verification IP
54 pages
SystemVerilog Veriflcation
No ratings yet
SystemVerilog Veriflcation
68 pages
SystemVerilog for Chip Designers
No ratings yet
SystemVerilog for Chip Designers
7 pages
Introduction To SystemVerilog and Verification
No ratings yet
Introduction To SystemVerilog and Verification
107 pages
Uvm by Praveen
No ratings yet
Uvm by Praveen
26 pages
FSM in SV-class
75% (4)
FSM in SV-class
13 pages
Functional Coverage Development Tips Dos and Donts VH v10 I2
No ratings yet
Functional Coverage Development Tips Dos and Donts VH v10 I2
6 pages
SystemVerilog DPI-C Interface Guide
No ratings yet
SystemVerilog DPI-C Interface Guide
25 pages
Design Verification Engineer Resume Example
No ratings yet
Design Verification Engineer Resume Example
5 pages
M - Sequence and P - Sequencer in UVM
No ratings yet
M - Sequence and P - Sequencer in UVM
4 pages
Polymorphic Interfaces An Alternative For Systemverilog Interfaces
No ratings yet
Polymorphic Interfaces An Alternative For Systemverilog Interfaces
7 pages
AXI Stream Protocol
No ratings yet
AXI Stream Protocol
8 pages
TLM With Examples
No ratings yet
TLM With Examples
13 pages
15 SVAssertionsLecture1
No ratings yet
15 SVAssertionsLecture1
20 pages
Virtual Sequence PDF
No ratings yet
Virtual Sequence PDF
35 pages
Resume Based Interview Material
No ratings yet
Resume Based Interview Material
44 pages
SystemVerilog DPI Tutorial
No ratings yet
SystemVerilog DPI Tutorial
6 pages
A Sic Verification Interview Questions
No ratings yet
A Sic Verification Interview Questions
30 pages
Chap 9 Functional Coverage PDF
No ratings yet
Chap 9 Functional Coverage PDF
46 pages
Behavioral Model of A DDR Memory Controller in A DFi - Frequency Ratio System
100% (1)
Behavioral Model of A DDR Memory Controller in A DFi - Frequency Ratio System
10 pages
Introduction in Uvm: Daian Stein Junior Verification and Design Engineer at Ethergate Polytehnic University of Timisoara
100% (1)
Introduction in Uvm: Daian Stein Junior Verification and Design Engineer at Ethergate Polytehnic University of Timisoara
34 pages
Resume Deshdeepak
No ratings yet
Resume Deshdeepak
4 pages
PCIe Transaction and Data Link Layers Verification IP Development Using UVM
No ratings yet
PCIe Transaction and Data Link Layers Verification IP Development Using UVM
4 pages
Amba Specification Advanced Extensible Interface Bus (Axi)
No ratings yet
Amba Specification Advanced Extensible Interface Bus (Axi)
37 pages
Chap 6 Randomization
No ratings yet
Chap 6 Randomization
73 pages
Functional Coverage Notes, Bins Caluclation, Bins Types
No ratings yet
Functional Coverage Notes, Bins Caluclation, Bins Types
25 pages
Development of JTAG Verification IP in U PDF
No ratings yet
Development of JTAG Verification IP in U PDF
3 pages
MGC DVCon 13 Sequence Sequence On The Wall Who's The Fairest of Them All
No ratings yet
MGC DVCon 13 Sequence Sequence On The Wall Who's The Fairest of Them All
24 pages
UVM Testbench Architecture Guide
No ratings yet
UVM Testbench Architecture Guide
6 pages
Interview Questions
100% (1)
Interview Questions
6 pages
UVMRegisters
100% (3)
UVMRegisters
48 pages
Difference: new() vs new[] in SystemVerilog
No ratings yet
Difference: new() vs new[] in SystemVerilog
9 pages
DebuggingUVM PDF
No ratings yet
DebuggingUVM PDF
58 pages
System Verilog Assertion Guide
No ratings yet
System Verilog Assertion Guide
14 pages
Uart Core With Apb
No ratings yet
Uart Core With Apb
31 pages
ASIC Verification: Randomization 2
No ratings yet
ASIC Verification: Randomization 2
22 pages
System Verilog Short Notes
100% (1)
System Verilog Short Notes
22 pages
Uvm Cookbook Recipe of The Month More-Uvm-Registers Tfitzpatrick PDF
No ratings yet
Uvm Cookbook Recipe of The Month More-Uvm-Registers Tfitzpatrick PDF
38 pages
1 Architecting Uvm Testbench
No ratings yet
1 Architecting Uvm Testbench
12 pages
Multiprocessor Architectures & Cache Coherence
No ratings yet
Multiprocessor Architectures & Cache Coherence
54 pages
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
No ratings yet
Cache Coherency in Multiprocessors (MPS) / Multi-Cores: Topic 9
79 pages
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
No ratings yet
Shared-Memory Architectures: Adapted From A Lecture by Ian Watson, University of Machester
33 pages
Cache Coherence Protocols Guide
No ratings yet
Cache Coherence Protocols Guide
24 pages
EGC121lect20 Multicore MSI Protocol
No ratings yet
EGC121lect20 Multicore MSI Protocol
39 pages
Cache Coherence in SMP Systems
No ratings yet
Cache Coherence in SMP Systems
29 pages
Cache Coherence
No ratings yet
Cache Coherence
53 pages
Shared Memory Architecture Concepts and Performance Issues: Outline
No ratings yet
Shared Memory Architecture Concepts and Performance Issues: Outline
7 pages
Icelake Server Cha DB Experimental
No ratings yet
Icelake Server Cha DB Experimental
46 pages
L39 - Centralized Shared Memory Architectures
No ratings yet
L39 - Centralized Shared Memory Architectures
31 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
PS5
No ratings yet
PS5
3 pages
Perfbook 1c E2 rc11
No ratings yet
Perfbook 1c E2 rc11
881 pages
Mumbai University Question Papers
No ratings yet
Mumbai University Question Papers
6 pages
Advanced Computer Architecture Guide
No ratings yet
Advanced Computer Architecture Guide
22 pages
Cache Coherence (Part 1)
No ratings yet
Cache Coherence (Part 1)
13 pages
ECE 485 Cache Simulation Report
No ratings yet
ECE 485 Cache Simulation Report
30 pages
Debugger Ppcqoriq
No ratings yet
Debugger Ppcqoriq
90 pages
Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996
No ratings yet
Snooping vs. Directory Based Coherency: Professor David A. Patterson Computer Science 252 Fall 1996
59 pages
The MESI Protocol
100% (1)
The MESI Protocol
4 pages
Tesla Project Presentation
No ratings yet
Tesla Project Presentation
20 pages
MOSFET and CMOS Design Essentials
No ratings yet
MOSFET and CMOS Design Essentials
44 pages
543 Net Architecture Interview Questions Answers Guide
No ratings yet
543 Net Architecture Interview Questions Answers Guide
7 pages
CH17 COA9e Parallel Processing
No ratings yet
CH17 COA9e Parallel Processing
52 pages
Module 5 - Pentium Processors - Final
No ratings yet
Module 5 - Pentium Processors - Final
43 pages
Parallel Computing Pastpaper Solve by Noman Tariq
No ratings yet
Parallel Computing Pastpaper Solve by Noman Tariq
30 pages
Arquitectura
No ratings yet
Arquitectura
8 pages
Intel - Performance Analysis Guide For Intel® Core™ I7 Processor and Intel® Xeon™ 5500 Processors
No ratings yet
Intel - Performance Analysis Guide For Intel® Core™ I7 Processor and Intel® Xeon™ 5500 Processors
72 pages
Microprocessor Module-5 Question Answers
No ratings yet
Microprocessor Module-5 Question Answers
8 pages
Week 5 PDC
No ratings yet
Week 5 PDC
12 pages
Micro - Arch Openpiton
No ratings yet
Micro - Arch Openpiton
51 pages
Multiprocessor Cache Coherence Design
No ratings yet
Multiprocessor Cache Coherence Design
32 pages
AdhamsLaw MemHierachy
No ratings yet
AdhamsLaw MemHierachy
8 pages
CacheCoherencyWhitepaper 6june2011 PDF
No ratings yet
CacheCoherencyWhitepaper 6june2011 PDF
15 pages
Lecture12 PDF
No ratings yet
Lecture12 PDF
9 pages
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
No ratings yet
Cheat Sheet Prepared For Advanced Computer Architecture Midterm Exam - UofM
11 pages
Cache Coherence Protocol For Multi-Core CVA6
No ratings yet
Cache Coherence Protocol For Multi-Core CVA6
4 pages