Cache 2 Output
Cache 2 Output
l\Iemoly stall clock cycles = (Reads x Read miss rate x Read miss pellalty +
CPU time = (CPU execution clock cycles + Memoiy stall clock cycles) x clock cycle time
• Caches are faster (11e11ce more expensive, limited i11 size) that DRAM
• Cache is the name given to the first level of the memory hierarchy,
Compiler optimizations
Trace caches Merging write buffers
et
Way prediction 111155
Pipelined cache access
Higher associativity Read 111iss before write Compiler prefetchiug
t1anslatio11
Avoiding address Larger cache size Critical word fist Hardware prefetching
/.,....
~.
\ I
Average memory access tissue = Hit Time + Miss rate x Miss Penalty
Memory stall clock cycles = (Reads X Read 111iss rate X Read miss penalty +
CPU time = (CPU execution clock cycles + Memoiy stall clock cycles) x clock cycle time
Miss penalty (Ll) = Hit ti111e (L2) + Miss rate (L2) X Miss penalty (LZ)
Average 111e11101y access time = Hit time (Ll) + Miss rate (Ll) X Miss penalty (Ll)
0% . I 3%-T. 3%
I 4% __L4°/O i I I
4% 4% 3% 2% 2%
10% 6% 5% 4%
20%
. . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . ... . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. .. ... . . . . . . . . .. . .. .. .. . . . . . . ..
34%
30%
. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .
39% Q
40% K
. . . . .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 46% . . .. . .... .. . . . . . . ..
l[\]I_\ 51 % *
50% 55%
Miss rate . . . . .. . . .. . . . . . . * . . . ... . . . . . . . . .. . .. .. .. . . . . . . ..
60%
. . . . .. . . . . . . . . . .. ... .... . . . . . . . ..... . . . . 67% . . . . . . . . . . . . . . . . . . . .. .. . .. . . . . . . . . .. . .... .. . . . . . . . .
70%
. . . . .. . . . . . . . . . . . . . ... . . . . . . . . . . .. . . . .. .. . . . . . . .. .. . .. . . . . . . . . . . . .
* Single cache miss rate
80%
. . . ... . . . . . . . . . . . .. . .. . . . . . . . . . . .. . . . .. . . . . .. . . . . . . . . . . . . . -.- Global miss rate
88%
•
Local miss rate
90% 96%
99% 99% 98%
*
100% O O Q
. .. . . ... . . . . . . . . . . . . . . . . . . . . . . .
• Drawbacks
it to the CPU as soon as it arrives; let the CPU continue execution while
- Critical Word First: Request the missed word first from memory and send
continue execution
requested word of the block arrives, send it to the CPU and let the CPU
• Idea: Don't wait for full block to be loaded before restarting CPU
buffer
Write
-|
Data
T
* > Victim cache
Tag
o i
1 in out
>l
I Data Data
address
CPU
- Used in Alpha, HP machines
• How to combine the fast hit time of direct-mapped caches, yet still
so misses will occur due to blocks being discarded and later retrieved
The cache may not contain all blocks needed during program execution,
• Capacity
The first access to a block is not in the cache, so the block must be
Classifying Misses: 3 Cs
3. Change compiler
2. Change associativity
• Large blocks: Not all the data is useful, but displaces useful data
1- N
1" to <.o N m
co N q- oo co
'r
HI
H
I
II u
.-.
't
l | .
II 256K
.
64K
II u
10%
Rate
16K
Miss .
I
l
15%
II.-------------------------,.----------------.
.
4K
.
20%
6 C C C A 1K
.
25%
• 2: 1 Cache Rule
- Tag comparison only with this block (cheaper as opposed to with all)
How to combine fast hit time of direct-mapped caches with the lower
Pseudoassociativity
B.3. Reducing Miss Rate via Way Prediction and
Implementation of a Set-Associative Cache
A d dr es s
31 3 0 -- 1 2 11 10 9 8 --- 3 2 1 0
22 8
Set
253
254
255
`
5 5
I
22 32
1
_ _
I
LI I
Iill l
I
I I
I II' ( 4 - to - 1 m ultip le xo r )
1
H it D a ta
HIT MISS
MISS
SLOW HIT
(change entry in
prediction table) Read block of data from
next level of cache
Reducing Miss rates by Compiler
Optimizations
Loop Interchange Example
/* Before */
*
x[i][j] = 2 X[i][j];
/* After */ \
for (k = 0; k < 100; k = k+1) 3
x[i][j] = 2 * x[i][j];
Splitted loops: every access to a and c misses. Fused loops: only 1st
access misses. Improves temporal locality
Summary of Compiler Optimizations
to Reduce Cache Misses
I
vpenta (nasa7)
gmty (nasa7)
tomcatv
btrix (nasa7)
mxm (nasa7)
spice
cholesky (nasa7)
compress
I
1 1.5 2 2.5 3
Performance Improvement
l
merged arrays l
loop interchange ' loop fusion blocking
C. Using Parallelism to Reduce Miss Penalty/Rate
c
Idea: Permit multiple "outstanding" memory operations
• Non-blocking caches
Hardware prefetching
• Software prefetching
C.l. Non-blocking Caches to Reduce Stalls on Misses
• "hit under miss" reduces the effective miss penalty by worldng during
• "hit under multiple miss" or "miss under miss" may further lower the
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . ... . . . . . . . . . . . . . . . . . . . . .. . . . . . . .. .. . . . .. . . . . . . . . .. . . . .. . .. . . . . . . . . . .
100%
90%
.. . . .. . .. . . . .. . . . . . .. . . . . . . . . . .. . .. . . .. .. . . . . .. . . .. .. .. . .. . . . . . . . . . . .. . . . 81%
16%
78%
70%
Percentage
....... .... ... ... . ..... . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
of the average 60%
memory
. . .. ... .. ... .. ... ..... ..... ..... .. ... . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .
50% 51%
stall time
Hit under 2 misses
.. ..... . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .
40% 41
20%
II .. . . . . . 39% . . . . . .. ... .. . ... . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .. . . . . .. .. . . . . . . . . . . . . . . . . .. . . . .. . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . .. . . . . . . . . . . . . . .
10%
I I I I I I I I I I
l L L
0% .
Q \ Q,
6 Q, '\ o <> ,$' rb .Q 'G
of §' Q 00 &o <& 6° 6 40 Q .IQ \
6 6\ go @Q
c$~ ®
5s Q <° Q ®9° I.
q,¢:Q QQ I
\ ,I
00 °o
0
Floating-point
Reducing misses by hardware
prefetching of Instruction and Data
Hardware Instruction Prefetching
• Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and the
next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream buffer
block into cache and prefetch next block (i+2)
Prefetched
Req instruction block
block Stream
Buffer
CPU
L1 Unified L2
HE 4
• Strided prefetch
– If observed sequence of accesses to block: b, b+N, b+2N, then prefetch
b+3N etc.
Performance impact of prefetching
Performance Improvement
2.20 I
I 1.97
I
2.00 I
I
1.80 I
I
1
1.26 1.29 1.32
1.40 ' 1.18 1.20 1.21
1
1.16
1.20 I
P I
I 4
1.00 I II
I
I I I I I I I
u
p
el
3d
ke
e
im
c
id
cf
s
pl
re
ca
ga
is
lg
gr
m
ua
sw
m
I
pw
ap
ce
ga
lu
m
fa
eq
fa
wu
SPECint2000 : SPECfp2000
D. Reducing Cache Hit Time
Other techniques
•
Needs to cope with several problellls
- Reuse of virtual addresses across processes (flushing cache after context switch)
- Aliasine/synonylus: Two processes refer to the same physical address (results iii
Implications
Tag ex offset
Set-associative caches
•
Pentium III (SKB pages): 2-way set-associative 16 KB cache
•
IBM 3033 (4KB pages): 16-way set-associative 64 KB cache
Translation Lookaside Buffer
TLB Operation
QQCQQCICIICQQ IICIIIICIIDICII DI II.
g Virtual Address I
I
I
:
l
g | Page # | Offset | l
TLB I
I
I
I
TLB miss
1-
I I
TLB I
Cache Operation
I
hit I
""' ""' "' "' ""' ""' """' |
.l
I
.
I
I
Ren! Address l
l
I
I
l
l
|
\r I
l>
Tngl Remainder Hit .
Value
.
I
Cnche >
l
I
I
v I
I
n [i's I
I
4
I .
I
l
I
l
I
.
I
.I
>
I\Iain
1 Ielnory
Page Table
Value
» >
1
Virtual page number <51 > Page offset <13>
TLB tag compare address <43> | TLB index <8> L1 cache index <7> Block offset <6>
8KB
To CPU
>
TLB tag <43> TLB data 28 " L1 cache tag 28 L1 data <512>
L2 tag compare address <19> L2 cache index <16> Block offset <6>
4MB To CPU
J
I=?
To L1 cache or CPU
ure 5.3 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache
ess.The page size is 8 KB.The TLB is direct mapped with 256 erltries.The L1 cache is a directmapped 8 KB, and
L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks.The virtual address is 64 bits and the physical address
O bits.The primary difference between this figure and a real memory hierarchy, as in Figure 5.18 on page 327, is
her associativity for caches and TLBs and a smaller virtual address than 64 bits.
Inclusive Cache
II
L1 L1 L1 L1
Back revalidation
II
L2 L2 L2 L2
(b) (C)
Consider a CPU with two levels of cache memory. Now, suppose a block X is requested. If the block is found in
the L1 cache, then the data is read from the L1 cache and consumed by the CPU core. However, if the block is
not found in the L1 cache, but is present in L2, then it’s fetched from the L2 cache and placed in L1.
If the L1 cache is also full, a block is evicted from L1 to make room for the newer block while the L2 cache is
unchanged. However, if the data block is found neither in L1 and L2, then it’s fetched from the memory and placed
in both the cache levels. In this case, if the L2 cache is full and a block is evicted to make room for the new data,
the L2 cache sends an invalidation request to the L1 cache, so the evicted block is removed from there as well.
Due to this invalidation procedure, an inclusive cache is slightly slower than a non-inclusive or exclusive cache.
Exclusive Cache
L1 L1 L1 L1
L2 L2 L2 L2
Initial State
Install X in L1 only. Install Y in L1 only. place in L2
(a)
(b) (C) (d)
Now, let’s consider the same example with non-inclusive or exclusive cache. Let’s suppose that the CPU core
sends a request for block X. If block X is found in L1, then it’s read and consumed by the core from that location.
However, if block X is not found in L1, but present in L2, then it’s moved from L2 to L1. If there’s no room in L1,
one block is evicted from L1 and stored in L2. This is the only way L2 cache is populated, and as such acts as a
victim cache. If block X isn’t found in L1 or L2, then it’s fetched from the memory and placed in just L1.
PSEUDO ASSOCIATIVE / COLUMN ASSOCIATIVE CACHE
b
DO
at as
Tag Index
address Cache
2 al
not
3
4 0
5 1
6 a 2 at 3
7 a
001
b
010
an al 85
011
100
101
f
* 210
as
i 111
Tag Index
)
address Cache
I
010
01 110
ax f
011
100
101
Cache
done na
hit miss
l
.¢»
swap dobber2
Multilevel caches + 2
Victim Caches + + 2
Higher Associativi + 1
Pseudo-Associative Caches + 2
q
Non-Blocking Caches + J