KEMBAR78
Lecture 19 Memory Cache | PDF | Cpu Cache | Central Processing Unit
0% found this document useful (0 votes)
3 views62 pages

Lecture 19 Memory Cache

The document discusses the importance of cache memory in computer architecture, highlighting its role in bridging the performance gap between processors and memory. It covers various cache types, their configurations, and strategies for improving cache performance, including reducing miss rates and penalties. Additionally, it examines the impact of cache design choices on overall system performance, emphasizing the significance of locality and cache coherence.

Uploaded by

Annam Lakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views62 pages

Lecture 19 Memory Cache

The document discusses the importance of cache memory in computer architecture, highlighting its role in bridging the performance gap between processors and memory. It covers various cache types, their configurations, and strategies for improving cache performance, including reducing miss rates and penalties. Additionally, it examines the impact of cache design choices on overall system performance, emphasizing the significance of locality and cache coherence.

Uploaded by

Annam Lakshmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 62

Caches

Lotzi Bölöni

EEL 5708
Acknowledgements
• All the lecture slides were adopted from the
slides of David Patterson (1998, 2001) and
David E. Culler (2001), Copyright 1998-2002,
University of California Berkeley

EEL 5708
Question: Who Cares About
the Memory Hierarchy?

1000 CPU
µProc
60%/yr.
“Moore’s Law”
Performance

CPU-DRAM Gap
100 Processor-Memory
Performance Gap:
(grows 50% / year)
10 “Less’ Law?” DRAM
DRAM
7%/yr.
1
198
198
198
198
198
198
198
198
198
198
199
199
199
199
199
199
199
199
199
199
200
0
1
2
3
4
5
6
7
8
9
0

2
3
4
5
6
7
8
9
0
• 1980: no cache in µproc; 1995 2-level cache on
chip
(1989 first Intel µproc with a cache on chip)EEL 5708
Generations of
Microprocessors
• Time of a full cache miss in instructions
executed:
1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136
2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320
3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648
• 1/2X latency x 3X clock rate x 3X Instr/clock  ­
5X

EEL 5708
Processor-Memory
Performance Gap “Tax”
Processor % Area
%Transistors
(­cost) (­power)
• Alpha 21164 37% 77%
• StrongArm SA110 61% 94%
• Pentium Pro 64% 88%
– 2 dies per package: Proc/I$/D$ + L2$
• Caches have no inherent value,
only try to close performance gap

EEL 5708
What is a cache?
• Small, fast storage used to improve average
access time to slow memory.
• Exploits spatial and temporal locality
• In computer architecture, almost everything is a
cache!
– Registers a cache on variables
– First-level cache a cache on second-level cache
– Second-level cache a cache on memory
– Memory a cache on disk (virtual memory)
– TLB a cache on page table
– Branch-prediction a cache on prediction information?
Proc/Regs
L1-
Cache
Bigger L2-Cache Faster

Memory

Disk, Tape, etc. EEL 5708


Example: 1 KB Direct Mapped
Cache
• For a 2 ** N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag
– The lowest M bits are the Byte Select (Block Size = 2 **
M)
Block address
31 9 4 0
Cache Tag Example: 0x50 Cache Index Byte Select
Ex: 0x01 Ex: 0x00
Stored as part
of the cache “state”

Valid Bit Cache Tag Cache Data


Byte 31 Byte 1 Byte 0 0

: :
0x50 Byte 63 Byte 33 Byte 32 1
2
3

: : :
Byte 1023 Byte 992 31

:
EEL 5708
Set Associative
Cache
• N-way set associative: N entries for each
Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache
– The two tags in the set are compared to the input in
parallel
– Data is selected based on the tag result
Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block EEL 5708
Hit
Disadvantage of Set Associative
Cache
• N-way Set Associative Cache versus Direct
Mapped Cache:
– N comparators vs. 1
– Extra MUX delay for the data
– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is
available BEFORE Hit/Miss:
– Possible to assume a hit and continue. Recover later if
miss. Cache Index
Valid Cache Tag Cache Data Cache Data Cache Tag Valid
Cache Block 0 Cache Block 0

: : : : : :

Adr Tag
Compare Sel1 1 Mux 0 Sel0 Compare

OR
Cache Block EEL 5708
Hit
Review: Cache
performance
• Miss-oriented Approach to Memory
Access:
 MemAccess 
CPUtime  IC  CPI  MissRate MissPenalt y  CycleTime
 Execution Inst 
 MemMisses 
CPUtime  IC  CPI  MissPenalty  CycleTime
 Execution Inst 

– CPIExecution includes ALU and Memory instructions


• Separating out Memory component
entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include
AluOps memory instructions
MemAccess 
CPUtime  IC  CPI  AMAT  CycleTime
 Inst Inst
AluOps

AMAT  HitTime  MissRate MissPenalty
HitTime Inst  MissRate Inst MissPenalty Inst  
HitTime Data  MissRate Data MissPenalty Data  EEL 5708
Impact on
Performance
• Suppose a processor executes at
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI =
1.1
– 50% arith/logic, 30% ld/st, 20% control
• Suppose that 10% of memory operations get 50
cycle miss penalty
• Suppose that 1% of instructions get same miss
penalty
• CPI = ideal CPI + average stalls per instruction
1.1(cycles/ins) +
[ 0.30 (DataMops/ins)
x 0.10 (miss/DataMop) x 50
(cycle/miss)] +
[ 1 (InstMop/ins)
x 0.01 (miss/InstMop) x 50
(cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1
• 58% of the time the proc is stalled waiting forEEL 5708
Example: Harvard Architecture
• Unified vs Separate I&D (Harvard)

• Table on page 384:


– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops  75% accesses from instructions (1.0/1.33)
– hit time=1, miss time=50
– Note that data hit has 1 stall for unified cache (only one port)

AMAT Ha rva rd =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05


AMAT Unified =75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24

Proc
Unified I-Cache-1 Proc D-Cache-1
Cache-1 Unified
Cache-2
Unified
Cache-2

EEL 5708
Review: Four Questions
for Memory Hierarchy
Designers

• Q1: Where can a block be placed in the upper


level? (Block placement)
– Fully Associative, Set Associative, Direct Mapped
• Q2: How is a block found if it is in the upper
level?
(Block identification)
– Tag/Block
• Q3: Which block should be replaced on a miss?
(Block replacement)
– Random, LRU
• Q4: What happens on a write?
(Write strategy)
– Write Back or Write Through (with Write Buffer)

EEL 5708
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

EEL 5708
Reducing Misses
• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache,
so the block must be brought into the cache. Also called cold
start misses or first reference misses.
(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed
during execution of a program, capacity misses will occur due
to blocks being discarded and later retrieved.
(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or
direct mapped, conflict misses (in addition to compulsory &
capacity misses) will occur because a block can be discarded
and later retrieved if too many blocks map to its set. Also
called collision misses or interference misses.
(Misses in N-way Associative, Size X Cache)
• More recent, 4th “C”:
– Coherence - Misses caused by cache coherence.

EEL 5708
3Cs Absolute Miss Rate
(SPEC92)

0.14
1-way
0.12 Conflict
Miss Rate per Type

2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04

0.02
0

128
1

16

32

64
Compulsory vanishingly Compulsory
Cache Size (KB)
small
EEL 5708
2:1 Cache Rule
miss rate 1-way associative cache size X
= miss rate 2-way associative cache size X/2
0.14
1-way
0.12 Conflict
Miss Rate per Type

2-way
0.1
4-way
0.08
8-way
0.06
Capacity
0.04

0.02
0

128
1

16

32

64
Cache Size (KB) Compulsory
EEL 5708
3Cs Relative Miss Rate

100%
1-way
Miss Rate per Type

80% Conflict
2-way
4-way
60% 8-way

40%
Capacity

20%

0%
1

16

32

64

128
Flaws: for fixed block size
Good: insight => inventionCache Size (KB) Compulsory
EEL 5708
1. Reduce Misses via
Larger Block Size

25%

20% 1K

4K
15%
Miss
16K
Rate
10%
64K
5% 256K

0% 128

256
16

32

64

Block Size (bytes)

EEL 5708
2. Reduce Misses via
Higher Associativity
• 2:1 Cache Rule:
– Miss Rate DM cache size N ­Miss Rate 2-way cache size
N/2
• Beware: Execution time is only final
measure!
– Will Clock Cycle time increase?
– Hill [1988] suggested hit time for 2-way vs. 1-way
external cache +10%,
internal + 2%

EEL 5708
Example: Avg. Memory
Access Time vs. Miss Rate

• Example: assume CCT = 1.10 for 2-way, 1.12 for 4-


way, 1.14 for 8-way vs. CCT direct mapped

Cache Size Associativity


(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.01
2 1.98 1.86 1.76 1.68
4 1.72 1.67 1.61 1.53
8 1.46 1.48 1.47 1.43
16 1.29 1.32 1.32 1.32
32 1.20 1.24 1.25 1.27
64 1.14 1.20 1.21 1.23
128 1.10 1.17 1.18 1.20

(Red means A.M.A.T. not improved by more associativity)

EEL 5708
3. Reducing Misses via a
“Victim Cache”
• How to combine fast hit
time of direct mapped
yet still avoid conflict
TAGS DATA
misses?
• Add buffer to place data
discarded from cache
• Jouppi [1990]: 4-entry
victim cache removed 20% Tag and Comparator One Cache line of Data
to 95% of conflicts for a 4 Tag and Comparator One Cache line of Data
KB direct mapped data
cache Tag and Comparator One Cache line of Data
• Used in Alpha, HP Tag and Comparator One Cache line of Data
machines
To Next Lower Level In
Hierarchy

EEL 5708
4. Reducing Misses via
“Pseudo-Associativity”
• How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
• Divide cache: on a miss, check other half of cache to see if there,
if so have a pseudo-hit (slow hit)

Hit Time

Pseudo Hit Time Miss Penalty

• Drawback: CPU pipeline is hardTime


if hit takes 1 or 2 cycles
– Better for caches not tied directly to processor (L2)
– Used in MIPS R1000 L2 cache, similar in UltraSPARC

EEL 5708
5. Reducing Misses by
Hardware Prefetching of
Instructions & Data
• E.g., Instruction Prefetching
– Alpha 21064 fetches 2 blocks on a miss
– Extra block placed in “stream buffer”
– On miss check stream buffer
• Works with data blocks too:
– Jouppi [1990] 1 data stream buffer got 25% misses
from 4KB cache; 4 streams got 43%
– Palacharla & Kessler [1994] for scientific programs
for 8 streams got 50% to 70% of misses from
2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory
bandwidth that can be used without
penalty
EEL 5708
6. Reducing Misses by
Software Prefetching Data
• Data prefetch
– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC
v. 9)
– Special prefetching instructions cannot cause faults; a form
of speculative execution
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
• Issuing prefetch instructions takes time
– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth

EEL 5708
7. Reducing Misses by
Compiler Optimizations
• McFarling [1989] reduced caches misses by 75%
on 8KB direct mapped cache, 4 byte blocks in software
• Instructions
– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
• Data
– Merging Arrays: improve spatial locality by single array of
compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order
stored in memory
– Loop Fusion: Combine 2 independent loops that have same looping
and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data
repeatedly vs. going down whole columns or rows

EEL 5708
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];

/* After: 1 array of stuctures */


struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];

Reducing conflicts between val & key;


improve spatial locality

EEL 5708
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding


through memory every 100 words;
improved spatial locality

EEL 5708
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per


access; improve spatial locality

EEL 5708
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:
– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache
Size:
– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
EEL 5708
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};

• B called Blocking Factor


• Capacity Misses from 2N3 + N2 to N3/B+2N2
• Conflict Misses Too?

EEL 5708
Reducing Conflict Misses by
Blocking
0.1
Miss Rate

Direct Mapped Cache


0.05

Fully Associative Cache

0
0 50 100 150
Blocking Factor

• Conflict misses in caches not FA vs. Blocking size


– Lam et al [1991] a blocking factor of 24 had a fifth the misses
vs. 48 despite both fit in cache
EEL 5708
Summary of Compiler
Optimizations to Reduce Cache
vpenta (nasa7)
Misses (by hand)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky
(nasa7)
compress

1 1.5 2 2.5 3
Performance Improvement

merged loop loop fusion blocking


arrays interchange
EEL 5708
Summary: Miss Rate
Reduction
 Memory accesses 
CPUtime IC  CPI  Miss rate Miss penalty Clock cycle time
 Execution
Instruction 

• 3 Cs: Compulsory, Capacity, Conflict


1. Reduce Misses via Larger Block Size
2. Reduce Misses via Higher Associativity
3. Reducing Misses via Victim Cache
4. Reducing Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Misses by Compiler Optimizations
• Prefetching comes in two flavors:
– Binding prefetch: Requests load directly into
register.
» Must be correct address and register!
– Non-Binding prefetch: Load into cache.
» Can be incorrect. Frees HW/SW to guess!
EEL 5708
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

EEL 5708
Write Policy:
Write-Through vs Write-
Back
• Write-through: all writes update cache and underlying
memory/cache
– Can always discard cached data - most up-to-date data is in memory
– Cache control bit: only a valid bit
• Write-back: all writes simply update cache
– Can’t just discard cached data - may have to write it back to memory
– Cache control bits: both valid and dirty bits
• Other Advantages:
– Write-through:
» memory (or other processors) always have latest data
» Simpler management of cache
– Write-back:
» much lower bandwidth, since data often overwritten multiple
times
» Better tolerance to long-latency memory?

EEL 5708
Write Policy 2:
Write Allocate vs Non-Allocate
(What happens on write-miss)

• Write allocate: allocate new cache line in


cache
– Usually means that you have to do a “read
miss” to fill in rest of the cache-line!
– Alternative: per/word valid bits
• Write non-allocate (or “write-around”):
– Simply send write data through to underlying
memory/cache - don’t allocate new cache line!

EEL 5708
1. Reducing Miss Penalty:
Read Priority over Write on Miss

CPU

in out
Write Buffer

write
buffer

DRAM
(or lower mem)

EEL 5708
1. Reducing Miss Penalty:
Read Priority over Write on
Miss
• Write-through with write buffers offer RAW
conflicts with main memory reads on cache misses
– If simply wait for write buffer to empty, might increase read miss
penalty (old MIPS 1000 by 50% )
– Check write buffer contents before read;
if no conflicts, let the memory access continue
• Write-back also want buffer to hold misplaced
blocks
– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read,
and then do the write
– CPU stall less since restarts as soon as do read

EEL 5708
2. Reduce Miss Penalty:
Early Restart and Critical
Word First
• Don’t wait for full block to be loaded before
restarting CPU
– Early restart—As soon as the requested word of the block
arrives, send it to the CPU and let the CPU continue
execution
– Critical Word First—Request the missed word first from
memory and send it to the CPU as soon as it arrives; let the
CPU continue execution while filling the rest of the words in
the block. Also called wrapped fetch and requested word
first
• Generally useful only in large blocks,
• Spatial locality a problem; tend to want next
sequential word, so not clear if benefit by early
restart
block
EEL 5708
3. Reduce Miss Penalty: Non-
blocking Caches to reduce stalls
on misses
• Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits during
a miss
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
• “hit under multiple miss” or “miss under miss”
may further lower the effective miss penalty by
overlapping multiple misses
– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Pentium Pro allows 4 outstanding memory misses

EEL 5708
Value of Hit Under Miss for
SPEC Hit Under i Misses

1.8

1.6
Avg. Mem. Access Time

1.4
0->1
1.2 0->1
1->2
1 1->2
2->64
0.8 2->64
0.6
Base
Base
0.4
“Hit under n Misses”
0.2

0 doduc
eqntott

tomcatv

wave5

nasa7
compress
espresso

hydro2d
fpppp

ora
ear

mdljdp2

spice2g6
alvinn
mdljsp2

swm256
xlisp

su2cor

Integer Floating Point


• FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss
EEL 5708
4: Add a second-level
cache
• L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit TimeL1 +


Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2)

• Definitions:
– Local miss rate— misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)
– Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
– Global Miss Rate is what matters

EEL 5708
Comparing Local and
Global Miss Rates
• 32 KByte 1st level cache;
Increasing 2nd level cache Linear
• Global miss rate close to
single level cache rate
provided L2 >> L1
• Don’t use local miss rate
• L2 not tied to CPU clock Cache Size
cycle!
• Cost & A.M.A.T. Log
• Generally Fast Hit Times
and fewer misses
• Since hits are few, target
miss reduction

Cache Size

EEL 5708
Reducing Misses:
Which apply to L2 Cache?

• Reducing Miss Rate


1. Reduce Misses via Larger Block Size
2. Reduce Conflict Misses via Higher Associativity
3. Reducing Conflict Misses via Victim Cache
4. Reducing Conflict Misses via Pseudo-Associativity
5. Reducing Misses by HW Prefetching Instr, Data
6. Reducing Misses by SW Prefetching Data
7. Reducing Capacity/Conf. Misses by Compiler
Optimizations

EEL 5708
L2 cache block size &
A.M.A.T.
Relative CPU Time

2 1.95
1.9
1.8
1.7
1.6 1.54
1.5
1.36 1.34
1.4 1.28 1.27
1.3
1.2
1.1
1
16 32 64 128 256 512
Block Size
• 32KB L1, 8 byte path to memory
EEL 5708
Reducing Miss Penalty

Summary
Memory accesses 
CPUtime IC  CPI  Miss rate Miss penalty Clock cycle time
 Execution
Instruction 
• Four techniques
– Read priority over write on miss
– Early Restart and Critical Word First on miss
– Non-blocking Caches (Hit under Miss, Miss under
Miss)
– Second Level Cache
• Can be applied recursively to Multilevel
Caches
– Danger is that time to DRAM will grow with multiple
levels in between
– First attempts at L2 caches can make things worse,
since increased worst case is worse

EEL 5708
Review: Improving Cache
Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.

AMAT  HitTime  MissRate MissPenalt y

EEL 5708
1. Fast Hit times
via Small and Simple
Caches
• Why Alpha 21164 has 8KB Instruction and
8KB data cache + 96KB second level cache?
– Small data cache and clock rate
• Direct Mapped, on chip

EEL 5708
2. Fast hits by Avoiding
Address Translation
CPU CPU CPU

VA VA VA
VA PA
TB $ $ TB
Tags Tags
PA VA PA
L2 $
$ TB

PA PA MEM

MEM MEM
Overlap $ access
with VA translation:
Conventional Virtually Addressed Cache requires $ index to
Organization Translate only on miss remain invariant
Synonym Problem across translation
EEL 5708
2. Fast hits by Avoiding Address
Translation
• Send virtual address to cache? Called Virtually
Addressed Cache or just Virtual Cache vs. Physical
Cache
– Every time process is switched logically must flush the cache;
otherwise get false hits
» Cost is time to flush + “compulsory” misses from empty cache
– Dealing with aliases (sometimes called synonyms);
Two different virtual addresses map to same physical address
– I/O must interact with cache, so need virtual address
• Solution to aliases
– HW guarantees covers index field & direct mapped, they must be
unique;
called page coloring
• Solution to cache flush
– Add process identifier tag that identifies process as well as
address within process: can’t get a hit if wrong process

EEL 5708
2. Fast Cache Hits by
Avoiding Translation:
Process ID impact
• Black is uniprocess
• Light Gray is multiprocess
when flush cache
• Dark Gray is multiprocess
when use Process ID tag
• Y axis: Miss Rates up to 20%
• X axis: Cache size from 2 KB
to 1024 KB

EEL 5708
2. Fast Cache Hits by Avoiding
Translation: Index with Physical
Portion of Address
• If index is physical part of address, can start
tag access in parallel with translation so
that can compare to physical tag

Page Address Page Offset

Address Tag Index Block Offset

• Limits cache to page size: what if want


bigger caches and uses same trick?
– Higher associativity moves barrier to right
– Page coloring
EEL 5708
3: Fast Hits by pipelining
Cache
Case Study: MIPS R4000
• 8 Stage Pipeline:
– IF–first half of fetching of instruction; PC selection
happens here as well as initiation of instruction cache
access.
– IS–second half of access to instruction cache.
– RF–instruction decode and register fetch, hazard checking
and also instruction cache hit detection.
– EX–execution, which includes effective address
calculation, ALU operation, and branch target
computation and condition evaluation.
– DF–data fetch, first half of access to data cache.
– DS–second half of access to data cache.
– TC–tag check, determine whether the data cache access
hit.
– WB–write back for loads and register-register operations.
• What is impact on Load delay?
– Need 2 instructions between a load and its use!
EEL 5708
Case Study: MIPS R4000

TWO Cycle IF IS RF EX DF DS TC WB
Load Latency IF IS RF EX DF DS TC
IF IS RF EX DF DS
IF IS RF EX DF
IF IS RF EX
IF IS RF
IF IS
IF

THREE Cycle IF IS RF EX DF DS TC WB
Branch Latency IF IS RF EX DF DS TC
(conditions evaluated IF IS RF EX DF DS
during EX phase) IF IS RF EX DF
IF IS RF EX
Delay slot plus two stalls
IF IS RF
Branch likely cancels delay slot if not taken
IF IS
IF
EEL 5708
R4000 Performance
• Not ideal CPI of 1:
– Load stalls (1 or 2 clock cycles)
– Branch stalls (2 cycles + unfilled slots)
– FP result stalls: RAW data hazard (latency)
– FP structural stalls: Not enough FP hardware
4.5 (parallelism)
4
3.5
3
2.5
2
1.5
1
0.5
0
doduc
eqntott

tomcatv
gcc

nasa7
espresso

ora

spice2g6

su2cor
li

Base Load stalls Branch stalls FP result stalls FP structural


stalls
EEL 5708
What is the Impact of
What You’ve Learned
About Caches?
1000
CPU
• 1960-1985: Speed
= ƒ(no. operations)
• 1990 100
– Pipelined
Execution &
Fast Clock Rate
– Out-of-Order 10
execution
– Superscalar DRAM

Instruction Issue1
• 1998: Speed =
1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000
ƒ(non-cached memory accesses)
• What does this mean for
– Compilers?,Operating Systems?, Algorithms?
Data Structures?
EEL 5708
Alpha
21064
• Separate Instr &
Data TLB & Caches
• TLBs fully associative
• TLB updates in SW
(“Priv Arch Libr”)
• Caches 8KB direct Instr Data
mapped, write thru
• Critical 8 bytes first
• Prefetch instr.
stream buffer
• 2 MB L2 cache, direct
mapped, WB (off-
chip)
Write
• 256 bit path to main
memory, 4 x 64-bit Buffer
modules Stream
• Victim Buffer: to give
read priority over
Buffer
write
• 4 entry write buffer
between D$ & L2$

Victim Buffer EEL 5708


Alpha Memory
Performance: Miss Rates
of SPEC92
100.00%

Tomcatv

Mdljp2
AlphaSort

Compress

Ora
TPC-B (db1)

Spice

Su2cor
Ear
Li

Doduc
Sc
I$ miss = 6%
D$ miss = 32%
L2 miss = 10%
10.00%

8K
Miss Rate

I$
1.00% D $ 8K
L2 2M

0.10%
I$ miss = 2% I$ miss = 1%
D$ miss = 13% D$ miss = 21%
L2 miss = 0.6% L2 miss = 0.3%
0.01%
EEL 5708
Alpha CPI Components
• Instruction stall: branch mispredict (green);
• Data cache (blue); Instruction cache (yellow); L2$ (pink)
Other: compute + reg conflicts, structural conflicts

5.00
4.50
4.00
3.50 L2
3.00 I$
CPI

2.50 D$
2.00 I Stall
1.50 Other
1.00
0.50
0.00

Tomcatv

Mdljp2
AlphaSort

Compress

Ora
TPC-B (db1)

Ear
Li

Doduc
Sc

EEL 5708
Pitfall: Predicting Cache
Performance from Different Prog.
(ISA,35%
compiler, ...)
D$, Tom
30% D: tomcatv
• 4KB Data cache
D: gcc
miss rate 8%,12%,
or 28%? 25% D: espresso
• 1KB Instr cache I: gcc
D$, gcc
miss rate 0%,3%,or 20% I: espresso
Miss
10%? I: tomcatv
Rate
• Alpha vs. MIPS 15% D$, esp
for 8KB Data $:
17% vs. 10%
10%
• Why 2X Alpha v.
MIPS? I$, gcc
5%

0%I$, esp
1 2 4 8 16 32 64 128
I$, Tom Cache Size (KB)
EEL 5708
Cache Optimization
Summary

Technique MR MP HT Complexity
miss rate

Larger Block Size + – 0


Higher Associativity + – 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
penalty

Early Restart & Critical Word 1st + 2


miss

Non-Blocking Caches + 3
Second Level Caches + 2
Better memory system + 3
Small & Simple Caches – + 0
hit time

Avoiding Address Translation + 2


Pipelining Caches + 2

EEL 5708

You might also like