Lecture 19 Memory Cache
Lecture 19 Memory Cache
Lotzi Bölöni
EEL 5708
Acknowledgements
• All the lecture slides were adopted from the
slides of David Patterson (1998, 2001) and
David E. Culler (2001), Copyright 1998-2002,
University of California Berkeley
EEL 5708
Question: Who Cares About
the Memory Hierarchy?
1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance
CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1
198
198
198
198
198
198
198
198
198
198
199
199
199
199
199
199
199
199
199
199
200
0
1
2
3
4
5
6
7
8
9
0
2
3
4
5
6
7
8
9
0
• 1980: no cache in µproc; 1995 2-level cache on
chip
(1989 first Intel µproc with a cache on chip)EEL 5708
Generations of
Microprocessors
• Time of a full cache miss in instructions
executed:
1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648
• 1/2X latency x 3X clock rate x 3X Instr/clock
5X
EEL 5708
Processor-Memory
Performance Gap “Tax”
Processor % Area
%Transistors
(cost) (power)
• Alpha 21164 37% 77%
• StrongArm SA110 61% 94%
• Pentium Pro 64% 88%
– 2 dies per package: Proc/I$/D$ + L2$
• Caches have no inherent value,
only try to close performance gap
EEL 5708
What is a cache?
• Small, fast storage used to improve average
access time to slow memory.
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a
cache!
– Registers a cache on variables
– First-level cache a cache on second-level cache
– Second-level cache a cache on memory
– Memory a cache on disk (virtual memory)
– TLB a cache on page table
– Branch-prediction a cache on prediction information?
Proc/Regs
L1-
Cache
Bigger L2-Cache Faster
Memory
: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3
: : :
Byte 1023 Byte 992 31
:
EEL 5708
Set Associative
Cache
• N-way set associative: N entries for each
Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in
parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block EEL 5708
Hit
Disadvantage of Set Associative
Cache
• N-way Set Associative Cache versus Direct
Mapped Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is
available BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if
miss. Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0
: : : : : :
Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare
OR
Cache Block EEL 5708
Hit
Review: Cache
performance
• Miss-oriented Approach to Memory
Access:
MemAccess
CPUtime IC CPI MissRate MissPenalt y CycleTime
Execution Inst
MemMisses
CPUtime IC CPI MissPenalty CycleTime
Execution Inst
Proc
Unified I-Cache-1 Proc D-Cache-1
Cache-1 Unified
Cache-2
Unified
Cache-2
EEL 5708
Review: Four Questions
for Memory Hierarchy
Designers
EEL 5708
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
EEL 5708
Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache,
so the block must be brought into the cache. Also called cold
start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded
and later retrieved if too many blocks map to its set. Also
called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
• More recent, 4th “C”:
– Coherence - Misses caused by cache coherence.
EEL 5708
3Cs Absolute Miss Rate
(SPEC92)
0.14
1-way
0.12 Conflict
Miss Rate per Type
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
128
1
16
32
64
Compulsory vanishingly Compulsory
Cache Size (KB)
small
EEL 5708
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
Miss Rate per Type
2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04
0.02
0
128
1
16
32
64
Cache Size (KB) Compulsory
EEL 5708
3Cs Relative Miss Rate
100%
1-way
Miss Rate per Type
80% Conflict
2-way
4-way
60% 8-way
40%
Capacity
20%
0%
1
16
32
64
128
Flaws: for fixed block size
Good: insight => inventionCache Size (KB) Compulsory
EEL 5708
1. Reduce Misses via
Larger Block Size
25%
20% 1K
4K
15%
Miss
16K
Rate
10%
64K
5% 256K
0% 128
256
16
32
64
EEL 5708
2. Reduce Misses via
Higher Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N Miss Rate 2-way cache size
N/2
• Beware: Execution time is only final
measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%
EEL 5708
Example: Avg. Memory
Access Time vs. Miss Rate
EEL 5708
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit
time of direct mapped
yet still avoid conflict
TAGS DATA
misses?
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry
victim cache removed 20% Tag and Comparator One Cache line of Data
to 95% of conflicts for a 4 Tag and Comparator One Cache line of Data
KB direct mapped data
cache Tag and Comparator One Cache line of Data
• Used in Alpha, HP Tag and Comparator One Cache line of Data
machines
To Next Lower Level In
Hierarchy
EEL 5708
4. Reducing Misses via
“Pseudo-Associativity”
• How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
• Divide cache: on a miss, check other half of cache to see if there,
if so have a pseudo-hit (slow hit)
Hit Time
EEL 5708
5. Reducing Misses by
Hardware Prefetching of
Instructions & Data
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses
from 4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs
for 8 streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without
penalty
EEL 5708
6. Reducing Misses by
Software Prefetching Data
• Data prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC
v. 9)
– Special prefetching instructions cannot cause faults; a form
of speculative execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
• Issuing prefetch instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth
EEL 5708
7. Reducing Misses by
Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
• Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
• Data
– Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order
stored in memory
– Loop Fusion: Combine 2 independent loops that have same looping
and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows
EEL 5708
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
EEL 5708
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
EEL 5708
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
EEL 5708
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache
Size:
– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
EEL 5708
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
EEL 5708
Reducing Conflict Misses by
Blocking
0.1
Miss Rate
0
0 50 100 150
Blocking Factor
1 1.5 2 2.5 3
Performance Improvement
EEL 5708
Write Policy:
Write-Through vs Write-
Back
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple
times
» Better tolerance to long-latency memory?
EEL 5708
Write Policy 2:
Write Allocate vs Non-Allocate
(What happens on write-miss)
EEL 5708
1. Reducing Miss Penalty:
Read Priority over Write on Miss
CPU
in out
Write Buffer
write
buffer
DRAM
(or lower mem)
EEL 5708
1. Reducing Miss Penalty:
Read Priority over Write on
Miss
• Write-through with write buffers offer RAW
conflicts with main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back also want buffer to hold misplaced
blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read
EEL 5708
2. Reduce Miss Penalty:
Early Restart and Critical
Word First
• Don’t wait for full block to be loaded before
restarting CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue
execution
– Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block. Also called wrapped fetch and requested word
first
• Generally useful only in large blocks,
• Spatial locality a problem; tend to want next
sequential word, so not clear if benefit by early
restart
block
EEL 5708
3. Reduce Miss Penalty: Non-
blocking Caches to reduce stalls
on misses
• Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits during
a miss
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
• “hit under multiple miss” or “miss under miss”
may further lower the effective miss penalty by
overlapping multiple misses
– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses
EEL 5708
Value of Hit Under Miss for
SPEC Hit Under i Misses
1.8
1.6
Avg. Mem. Access Time
1.4
0->1
1.2 0->1
1->2
1 1->2
2->64
0.8 2->64
0.6
Base
Base
0.4
“Hit under n Misses”
0.2
0 doduc
eqntott
tomcatv
wave5
nasa7
compress
espresso
hydro2d
fpppp
ora
ear
mdljdp2
spice2g6
alvinn
mdljsp2
swm256
xlisp
su2cor
• Definitions:
– Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters
EEL 5708
Comparing Local and
Global Miss Rates
• 32 KByte 1st level cache;
Increasing 2nd level cache Linear
• Global miss rate close to
single level cache rate
provided L2 >> L1
• Don’t use local miss rate
• L2 not tied to CPU clock Cache Size
cycle!
• Cost & A.M.A.T. Log
• Generally Fast Hit Times
and fewer misses
• Since hits are few, target
miss reduction
Cache Size
EEL 5708
Reducing Misses:
Which apply to L2 Cache?
EEL 5708
L2 cache block size &
A.M.A.T.
Relative CPU Time
2 1.95
1.9
1.8
1.7
1.6 1.54
1.5
1.36 1.34
1.4 1.28 1.27
1.3
1.2
1.1
1
16 32 64 128 256 512
Block Size
• 32KB L1, 8 byte path to memory
EEL 5708
Reducing Miss Penalty
Summary
Memory accesses
CPUtime IC CPI Miss rate Miss penalty Clock cycle time
Execution
Instruction
• Four techniques
– Read priority over write on miss
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit under Miss, Miss under
Miss)
– Second Level Cache
• Can be applied recursively to Multilevel
Caches
– Danger is that time to DRAM will grow with multiple
levels in between
– First attempts at L2 caches can make things worse,
since increased worst case is worse
EEL 5708
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
EEL 5708
1. Fast Hit times
via Small and Simple
Caches
• Why Alpha 21164 has 8KB Instruction and
8KB data cache + 96KB second level cache?
– Small data cache and clock rate
• Direct Mapped, on chip
EEL 5708
2. Fast hits by Avoiding
Address Translation
CPU CPU CPU
VA VA VA
VA PA
TB $ $ TB
Tags Tags
PA VA PA
L2 $
$ TB
PA PA MEM
MEM MEM
Overlap $ access
with VA translation:
Conventional Virtually Addressed Cache requires $ index to
Organization Translate only on miss remain invariant
Synonym Problem across translation
EEL 5708
2. Fast hits by Avoiding Address
Translation
• Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs. Physical
Cache
– Every time process is switched logically must flush the cache;
otherwise get false hits
» Cost is time to flush + “compulsory” misses from empty cache
– Dealing with aliases (sometimes called synonyms);
Two different virtual addresses map to same physical address
– I/O must interact with cache, so need virtual address
• Solution to aliases
– HW guarantees covers index field & direct mapped, they must be
unique;
called page coloring
• Solution to cache flush
– Add process identifier tag that identifies process as well as
address within process: can’t get a hit if wrong process
EEL 5708
2. Fast Cache Hits by
Avoiding Translation:
Process ID impact
• Black is uniprocess
• Light Gray is multiprocess
when flush cache
• Dark Gray is multiprocess
when use Process ID tag
• Y axis: Miss Rates up to 20%
• X axis: Cache size from 2 KB
to 1024 KB
EEL 5708
2. Fast Cache Hits by Avoiding
Translation: Index with Physical
Portion of Address
• If index is physical part of address, can start
tag access in parallel with translation so
that can compare to physical tag
TWO Cycle IF IS RF EX DF DS TC WB
Load Latency IF IS RF EX DF DS TC
IF IS RF EX DF DS
IF IS RF EX DF
IF IS RF EX
IF IS RF
IF IS
IF
THREE Cycle IF IS RF EX DF DS TC WB
Branch Latency IF IS RF EX DF DS TC
(conditions evaluated IF IS RF EX DF DS
during EX phase) IF IS RF EX DF
IF IS RF EX
Delay slot plus two stalls
IF IS RF
Branch likely cancels delay slot if not taken
IF IS
IF
EEL 5708
R4000 Performance
• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)
– Branch stalls (2 cycles + unfilled slots)
– FP result stalls: RAW data hazard (latency)
– FP structural stalls: Not enough FP hardware
4.5 (parallelism)
4
3.5
3
2.5
2
1.5
1
0.5
0
doduc
eqntott
tomcatv
gcc
nasa7
espresso
ora
spice2g6
su2cor
li
Instruction Issue1
• 1998: Speed =
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
ƒ(non-cached memory accesses)
• What does this mean for
– Compilers?,Operating Systems?, Algorithms?
Data Structures?
EEL 5708
Alpha
21064
• Separate Instr &
Data TLB & Caches
• TLBs fully associative
• TLB updates in SW
(“Priv Arch Libr”)
• Caches 8KB direct Instr Data
mapped, write thru
• Critical 8 bytes first
• Prefetch instr.
stream buffer
• 2 MB L2 cache, direct
mapped, WB (off-
chip)
Write
• 256 bit path to main
memory, 4 x 64-bit Buffer
modules Stream
• Victim Buffer: to give
read priority over
Buffer
write
• 4 entry write buffer
between D$ & L2$
Tomcatv
Mdljp2
AlphaSort
Compress
Ora
TPC-B (db1)
Spice
Su2cor
Ear
Li
Doduc
Sc
I$ miss = 6%
D$ miss = 32%
L2 miss = 10%
10.00%
8K
Miss Rate
I$
1.00% D $ 8K
L2 2M
0.10%
I$ miss = 2% I$ miss = 1%
D$ miss = 13% D$ miss = 21%
L2 miss = 0.6% L2 miss = 0.3%
0.01%
EEL 5708
Alpha CPI Components
• Instruction stall: branch mispredict (green);
• Data cache (blue); Instruction cache (yellow); L2$ (pink)
Other: compute + reg conflicts, structural conflicts
5.00
4.50
4.00
3.50 L2
3.00 I$
CPI
2.50 D$
2.00 I Stall
1.50 Other
1.00
0.50
0.00
Tomcatv
Mdljp2
AlphaSort
Compress
Ora
TPC-B (db1)
Ear
Li
Doduc
Sc
EEL 5708
Pitfall: Predicting Cache
Performance from Different Prog.
(ISA,35%
compiler, ...)
D$, Tom
30% D: tomcatv
• 4KB Data cache
D: gcc
miss rate 8%,12%,
or 28%? 25% D: espresso
• 1KB Instr cache I: gcc
D$, gcc
miss rate 0%,3%,or 20% I: espresso
Miss
10%? I: tomcatv
Rate
• Alpha vs. MIPS 15% D$, esp
for 8KB Data $:
17% vs. 10%
10%
• Why 2X Alpha v.
MIPS? I$, gcc
5%
0%I$, esp
1 2 4 8 16 32 64 128
I$, Tom Cache Size (KB)
EEL 5708
Cache Optimization
Summary
Technique MR MP HT Complexity
miss rate
Non-Blocking Caches + 3
Second Level Caches + 2
Better memory system + 3
Small & Simple Caches – + 0
hit time
EEL 5708