KEMBAR78
Cache 2 Output | PDF | Cpu Cache | Office Equipment
0% found this document useful (0 votes)
3 views37 pages

Cache 2 Output

cache2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views37 pages

Cache 2 Output

cache2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Memory accesses X Miss rate X Miss penalty

Writes X Write 111iss rate x Write miss pellalty)

l\Iemoly stall clock cycles = (Reads x Read miss rate x Read miss pellalty +

CPU time = (CPU execution clock cycles + Memoiy stall clock cycles) x clock cycle time

• Effect of caches on CPU execution time

Cache hit if value in cache, else cache miss

Larger than a single word: efficiency and spatial locality concerns

• Caches store values at the granularity of cache blocks (lines)

• Caches are faster (11e11ce more expensive, limited i11 size) that DRAM

To avoid having to go to memory every time this value is needed

• Retains the sa111e name as in memory (different train registers)

It senes as a temporary place where frequently-used values can be stored

encountered once the address leaves the CPU

• Cache is the name given to the first level of the memory hierarchy,

(Review) Cache Organization


Victim caches

Compiler optimizations
Trace caches Merging write buffers
et
Way prediction 111155
Pipelined cache access
Higher associativity Read 111iss before write Compiler prefetchiug
t1anslatio11

Avoiding address Larger cache size Critical word fist Hardware prefetching

S111a11»'si111ple ca les Larger block size Multilev-eI aclles I Noublocking caches

/.,....

~.

\ I

Average memory access tissue = Hit Time + Miss rate x Miss Penalty

- More reasonable to also include hit time in the performance equation

- Hard to achieve in current-day processors (faster clocks. larger caches)

Above assumes 1-cycle to hit in cache

Me11101y accesses X Miss rate x Miss penalty

Writes X 'Witte miss rate X Write miss penalty)

Memory stall clock cycles = (Reads X Read 111iss rate X Read miss penalty +

CPU time = (CPU execution clock cycles + Memoiy stall clock cycles) x clock cycle time

Improving C ache Performance


Avg. memory access time = 1 + 4% X (10 + 50% x 100) = 3.4 cycles

Global miss rates: 4% (Ll), 2% (L2)

Local miss rates: 4% (L1), 50% (L2) = 20/40

• Example: 1000 references, 40 misses in L1 cache and 20 in L2

: Miss rate (Ll), but Miss rate (Ll) x Miss rate(L2)

Global miss rate = Number of misses/total number of memory accesses

Local miss rate = Miss rate (Ll) or Miss rate (L2)

• Distinguish between two kinds of miss rates

Miss penalty (Ll) = Hit ti111e (L2) + Miss rate (L2) X Miss penalty (LZ)

Average 111e11101y access time = Hit time (Ll) + Miss rate (Ll) X Miss penalty (Ll)

• For a 2-level cache

Tradeoff between size (cache effectiveness) and cost (access time)

• Idea: Have multiple levels of caches

A. 1. Reducing Miss Penalty via Multilevel Caches


• Exclusive and cooperative caches

- Cost of big L2 is smaller than big Ll

• L2 needs to be significantly bigger to have reasonable miss rates

• l)oesn't make much sense to have L2 caches smaller than L1 caches

Cache size (KB)

4 8 16 32 64 128 256 512 1024 2048 4096

0% . I 3%-T. 3%
I 4% __L4°/O i I I

4% 4% 3% 2% 2%
10% 6% 5% 4%

20%
. . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . ... . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. .. ... . . . . . . . . .. . .. .. .. . . . . . . ..

34%
30%
. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .
39% Q

40% K
. . . . .. . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 46% . . .. . .... .. . . . . . . ..
l[\]I_\ 51 % *
50% 55%
Miss rate . . . . .. . . .. . . . . . . * . . . ... . . . . . . . . .. . .. .. .. . . . . . . ..

60%
. . . . .. . . . . . . . . . .. ... .... . . . . . . . ..... . . . . 67% . . . . . . . . . . . . . . . . . . . .. .. . .. . . . . . . . . .. . .... .. . . . . . . . .

70%
. . . . .. . . . . . . . . . . . . . ... . . . . . . . . . . .. . . . .. .. . . . . . . .. .. . .. . . . . . . . . . . . .
* Single cache miss rate

80%
. . . ... . . . . . . . . . . . .. . .. . . . . . . . . . . .. . . . .. . . . . .. . . . . . . . . . . . . . -.- Global miss rate
88%


Local miss rate
90% 96%
99% 99% 98%
*
100% O O Q
. .. . . ... . . . . . . . . . . . . . . . . . . . . . . .

Multilevel Caches (cont'd)


sequential word, so limited benefit by early restart

- Programs exhibiting spatial locality a problem; tend to want next

- Generally useful only in large blocks

• Drawbacks

Also called wrapped fetch and requested word fust


filling the rest of the words in the block

it to the CPU as soon as it arrives; let the CPU continue execution while

- Critical Word First: Request the missed word first from memory and send

continue execution

requested word of the block arrives, send it to the CPU and let the CPU

- Early restart: request the words in a block in order. As soon as the

• Idea: Don't wait for full block to be loaded before restarting CPU

Critical Word First and Early Restart

A.2. Reduce Miss Penalty via


Low
Lowerlevel ry
memory i
v

buffer

Write

-|
Data

T
* > Victim cache

Tag
o i

1 in out
>l
I Data Data

address

CPU
- Used in Alpha, HP machines

20% - 95% for a 4 KB direct mapped data cache

- Jouppi [1990]: 4-entry victim cache reduced conflict misses by

Remember what was recently discarded, just in case it is needed again

avoid conflict misses?

• How to combine the fast hit time of direct-mapped caches, yet still

A.5. Reducing Miss Penalty via a "Victim Cache"


(Misses in N-way Associative, Size X Cache)

(the rest of the cache might be unused)

Additional misses that occur because another block is occupying cache

• Conflict (Also called collision or interference misses)

(Misses in Fully Associative Size X Cache)

so misses will occur due to blocks being discarded and later retrieved

The cache may not contain all blocks needed during program execution,

• Capacity

(Misses in even an Infinite Cache)

brought into the cache.

The first access to a block is not in the cache, so the block must be

• Compulsory (Also called cold start or first reference misses)

Classifying Misses: 3 Cs

B. Reducing Cache Misses


Which of 3Cs is obviously affected?

3. Change compiler

Which of 3Cs is obviously affected?

2. Change associativity

Which of 3Cs is obviously affected?

1. Change block size

If we assume that total cache size is not changed, what happens if we

3 Cs: Compulsory, Capacity, Conflict

How Can We Reduce Misses?


• Also note larger blocks mean higher miss penalty

• Large blocks: Not all the data is useful, but displaces useful data

Small blocks: Data accesses spread over multlple blocks


Block Size (bytes) .

1- N

1" to <.o N m
co N q- oo co

'r
HI
H
I
II u
.-.
't
l | .

II 256K
.

64K
II u
10%
Rate
16K
Miss .
I
l

15%
II.-------------------------,.----------------.

.
4K
.

20%
6 C C C A 1K
.

25%

B. 1. Reducing Miss Rate via Larger Block Sizes


• Higher associativity leads to higher hit time and can outweigh the benefit

Issue: Increase in clock cycle time (CCT) may diminish benefits

• Is this actually the case"

Miss Rate of a 2-way cache of size N/2

Miss Rate of a direct-mapped cache size of size N

• 2: 1 Cache Rule

B.2. Reducing Miss Rate via Higher Associativity


- Used in MIPS Rl0000 L2 cache, similar in UltraSPARC

• If 1110st hits beco111e slow hits, degrading pe1f01111a11ce is possible

• Counts as a "slower nit"

- On a miss, check another location ("pseudoset") before going to memory

- Access proceeds as in direct-mapped cache

Pseudoassociative or Column associative


.

Used in Alpha 21264 (1-cycle if correct prediction (85%), 3-cycles o.w.)

Simplest prediction: remember the last word accessed

• Higher cost to check 11011-predicted blocks

- Tag comparison only with this block (cheaper as opposed to with all)

the next memory access hitting this set

Way prediction: Predict which block in a set is likely to be accessed by


.

- Previously looked at Victim Caches

conflict misses of set-associative caches?

How to combine fast hit time of direct-mapped caches with the lower

Pseudoassociativity
B.3. Reducing Miss Rate via Way Prediction and
Implementation of a Set-Associative Cache
A d dr es s
31 3 0 -- 1 2 11 10 9 8 --- 3 2 1 0

22 8

In d ex V Tag D a ta V Tag D a ta V T ag D ata V T ag D ata


0
1
2
II
F
I ) 1 l 1 I (p 1 r 4 I I » l r 4 I l l 1 l I

Set
253
254
255
`

5 5
I
22 32
1

_ _

I
LI I

Iill l
I

I I

I II' ( 4 - to - 1 m ultip le xo r )

1
H it D a ta

4-way set-associative cache with 4 comparators and one 4-to-1


multiplexor:size of cache is 1K blocks = 256 sets * 4-block set size
Way Predicting Caches
• Use processor address to index into way prediction table
• Look in predicted way at given index, then:

HIT MISS

Return copy Look in other way


of data from
cache \

MISS
SLOW HIT
(change entry in
prediction table) Read block of data from
next level of cache
Reducing Miss rates by Compiler
Optimizations
Loop Interchange Example

/* Before */

for (k = 0; k < 100; k k+1)

*
x[i][j] = 2 X[i][j];

/* After */ \
for (k = 0; k < 100; k = k+1) 3

for (i = 0; i < $000; i = i+1)


*

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

• "After" version accesses memory sequentially instead of in

strides of 100 words

Improved spatial locality: use all of the words in fetched blocks


Merging Arrays
int val[SIZE]; struct record{
int key[SIZE]; int val;
int key;
for (i=0; i<SIZE; i++){ };
key[i] = newkey; struct record records[SIZE];
val[i]++;
} for (i=0; i<SIZE; i++){
records[i].key = newkey;
records[i].val++;
}

• Reduces conflicts between val & key and improves spatial


locality
Loop Fusion
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
d[i][j] = a[i][j] + c[i][j];

for (i = 0; i < N; i++) Reference can be directly to register


for (j = 0; j < N; j++){
' a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];
}

Splitted loops: every access to a and c misses. Fused loops: only 1st
access misses. Improves temporal locality
Summary of Compiler Optimizations
to Reduce Cache Misses
I

vpenta (nasa7)

gmty (nasa7)

tomcatv

btrix (nasa7)

mxm (nasa7)

spice

cholesky (nasa7)

compress
I

1 1.5 2 2.5 3
Performance Improvement

l
merged arrays l
loop interchange ' loop fusion blocking
C. Using Parallelism to Reduce Miss Penalty/Rate

c
Idea: Permit multiple "outstanding" memory operations

Can overlap memory access latencies

Can benefit from activity done on behalf of other operations

Three commonly-employed schemes

• Non-blocking caches

Hardware prefetching

• Software prefetching
C.l. Non-blocking Caches to Reduce Stalls on Misses

• Decoupled instruction and data caches allow CPU to continue fetching

instructions while waiting on a data cache miss

L1 cache misses can be tolerated by superscalar out-of-order machines

• Non-blocldng or lockup-free caches allow data cache to continue to

supply cache hits during a miss

requires out-of-order execution CPU

• "hit under miss" reduces the effective miss penalty by worldng during

miss vs. ignoring CPU requests

• "hit under multiple miss" or "miss under miss" may further lower the

effective miss penalty by overlapping multiple misses

Significantly increases the complexity of the cache controller as there can

be multiple outstanding memory accesses

Typically also requires multiple memory banks

Pentium Pro allows 4 outstanding memory misses


Value of Hit-Under-Miss for SPEC92
SKB direct-mapped cache, 32B blocks 16-cycle penalty

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . ... . . . . . . . . . . . . . . . . . . . . .. . . . . . . .. .. . . . .. . . . . . . . . .. . . . .. . .. . . . . . . . . . .
100%

90%
.. . . .. . .. . . . .. . . . . . .. . . . . . . . . . .. . .. . . .. .. . . . . .. . . .. .. .. . .. . . . . . . . . . . .. . . . 81%

Hit under 1 miss


. . . . . . . . .. . . . . . . . . ...... ... . ....... . . . . . . . . . . . . . .. . . . . . .. . . . .. . . . . . .. . ...... . .. .. . ...
80%

16%
78%
70%
Percentage
....... .... ... ... . ..... . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
of the average 60%

memory
. . .. ... .. ... .. ... ..... ..... ..... .. ... . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .
50% 51%
stall time
Hit under 2 misses
.. ..... . . . . . . . . . . . .. . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .
40% 41

30% Hit under64 misses

20%
II .. . . . . . 39% . . . . . .. ... .. . ... . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .. . . . . .. .. . . . . . . . . . . . . . . . . .. . . . .. . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . .. . . . . . . . . . . . . . .
10%

I I I I I I I I I I
l L L
0% .
Q \ Q,
6 Q, '\ o <> ,$' rb .Q 'G
of §' Q 00 &o <& 6° 6 40 Q .IQ \
6 6\ go @Q
c$~ ®
5s Q <° Q ®9° I.

q,¢:Q QQ I

\ ,I
00 °o
0
Floating-point
Reducing misses by hardware
prefetching of Instruction and Data
Hardware Instruction Prefetching
• Instruction prefetch in Alpha AXP 21064
– Fetch two blocks on a miss; the requested block (i) and the
next consecutive block (i+1)
– Requested block placed in cache, and next block in
instruction stream buffer
– If miss in cache but hit in stream buffer, move stream buffer
block into cache and prefetch next block (i+2)
Prefetched
Req instruction block
block Stream
Buffer
CPU
L1 Unified L2
HE 4

Instruction Req Cache


RF block
Hardware Data Prefetching
• Prefetch-on-miss:
– Prefetch b + 1 upon miss on b

• One Block Lookahead (OBL) scheme


– Initiate prefetch for block b + 1 when block b is accessed
– Can extend to N block lookahead

• Strided prefetch
– If observed sequence of accesses to block: b, b+N, b+2N, then prefetch
b+3N etc.
Performance impact of prefetching
Performance Improvement

2.20 I
I 1.97
I
2.00 I
I
1.80 I
I

1.60 1.45 : 1.40


1.49

1
1.26 1.29 1.32
1.40 ' 1.18 1.20 1.21
1
1.16
1.20 I

P I
I 4
1.00 I II
I
I I I I I I I

u
p

el
3d

ke
e

im
c

id
cf

s
pl
re

ca
ga

is

lg

gr
m

ua
sw
m

I
pw

ap
ce
ga

lu

m
fa

eq
fa
wu

SPECint2000 : SPECfp2000
D. Reducing Cache Hit Time

Obvious approach: Smaller and simpler (low associativity) caches

Notable that L1 cache sizes have not increased

• Alpha 21264/21364; Ultra SPARC II III; AMD K6/Athlon

Other techniques

• Avoiding address translation during cache lookup

Alternative 1: Index caches using "virtual addresses"


Needs to cope with several problellls

- Protection (performed dining address translation)

- Reuse of virtual addresses across processes (flushing cache after context switch)

- Aliasine/synonylus: Two processes refer to the same physical address (results iii

having multiple copies of the same data)

- I/O (typically uses physical addresses)

Alternative 2: Use part of the page offset to index the cache

does not change between virtual and physical addresses


D. 1. Virtually Indexed, Physically Tagged Caches

Overlap indexing of cache with translation of virtual addresses

- Tag comparison done with physical addresses

Implications

Block address Block

Tag ex offset

• Direct-mapped caches can be no bigger than page size

Set-associative caches

- Page offset can be viewed as ( Index + block offset ) above

- Cache size = 2P*8° oElsct X Set associativity

- So, increased associativity allows larger cache sizes


Pentium III (SKB pages): 2-way set-associative 16 KB cache


IBM 3033 (4KB pages): 16-way set-associative 64 KB cache
Translation Lookaside Buffer
TLB Operation
QQCQQCICIICQQ IICIIIICIIDICII DI II.

g Virtual Address I
I

I
:
l
g | Page # | Offset | l

TLB I
I
I

I
TLB miss
1-
I I

TLB I

Cache Operation
I

hit I
""' ""' "' "' ""' ""' """' |

.l
I
.
I

I
Ren! Address l

l
I

I
l
l
|
\r I

l>
Tngl Remainder Hit .
Value
.
I
Cnche >
l
I
I
v I
I
n [i's I
I
4
I .
I
l
I
l
I
.
I
.I

>

I\Iain

1 Ielnory

Page Table

Value

» >

Figure 8.10 Translation Lookaside Buffer and Cache Operation


Virtual address <64>

1
Virtual page number <51 > Page offset <13>

TLB tag compare address <43> | TLB index <8> L1 cache index <7> Block offset <6>
8KB
To CPU
>

TLB tag <43> TLB data 28 " L1 cache tag 28 L1 data <512>

L1 tag compare address 28


256 Entries To CPU

Physical address <41 >

L2 tag compare address <19> L2 cache index <16> Block offset <6>

4MB To CPU

- L2 cache tag <19> L2 data <512>

J
I=?

To L1 cache or CPU

ure 5.3 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache

ess.The page size is 8 KB.The TLB is direct mapped with 256 erltries.The L1 cache is a directmapped 8 KB, and

L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks.The virtual address is 64 bits and the physical address

O bits.The primary difference between this figure and a real memory hierarchy, as in Figure 5.18 on page 327, is

her associativity for caches and TLBs and a smaller virtual address than 64 bits.
Inclusive Cache

II
L1 L1 L1 L1

Back revalidation

II

L2 L2 L2 L2

Read x Misses; Read Y Misses;


EvkzX from Ll Event Y from L2
Initial State
InstaIIXin both L1 Install Y in both L1
(d) (e)
(a)
and L2 . and 12.

(b) (C)

Consider a CPU with two levels of cache memory. Now, suppose a block X is requested. If the block is found in
the L1 cache, then the data is read from the L1 cache and consumed by the CPU core. However, if the block is
not found in the L1 cache, but is present in L2, then it’s fetched from the L2 cache and placed in L1.

If the L1 cache is also full, a block is evicted from L1 to make room for the newer block while the L2 cache is
unchanged. However, if the data block is found neither in L1 and L2, then it’s fetched from the memory and placed
in both the cache levels. In this case, if the L2 cache is full and a block is evicted to make room for the new data,
the L2 cache sends an invalidation request to the L1 cache, so the evicted block is removed from there as well.
Due to this invalidation procedure, an inclusive cache is slightly slower than a non-inclusive or exclusive cache.
Exclusive Cache

L1 L1 L1 L1

L2 L2 L2 L2

Read X Misses, Read Y Misses, Evictx from L1 and

Initial State
Install X in L1 only. Install Y in L1 only. place in L2

(a)
(b) (C) (d)

Now, let’s consider the same example with non-inclusive or exclusive cache. Let’s suppose that the CPU core
sends a request for block X. If block X is found in L1, then it’s read and consumed by the core from that location.
However, if block X is not found in L1, but present in L2, then it’s moved from L2 to L1. If there’s no room in L1,
one block is evicted from L1 and stored in L2. This is the only way L2 cache is populated, and as such acts as a
victim cache. If block X isn’t found in L1 or L2, then it’s fetched from the memory and placed in just L1.
PSEUDO ASSOCIATIVE / COLUMN ASSOCIATIVE CACHE

b
DO
at as

Tag Index

address Cache

Figure 1: Indexing into a direct-mapped cache using bit-selection


hashing.
.et

2 al
not
3

4 0

5 1

6 a 2 at 3
7 a

Column-Associative Two-Way Set-Associative

Figure 2: Comparison of colurnn~associative and two-way set-


associative caches of equal size. The conflict b[a;] = b[a,] is
resolved by both schemes.
coo

001
b
010
an al 85
011

100

101
f
* 210
as
i 111
Tag Index
)
address Cache
I

Figure 3: Indexing into a cache by bit selection andby bit flipping.


The conflict b[aa] = bids] is resolved by the bit-dipping rehash.
coo
at 01 mo incorrect
hut 001

010
01 110
ax f
011

100

101

a i 010 010 >-


110
an Ta • :01orOll
f correct no

ax 011 110 miss

Cache

Figure 4: Appending the high-order bit of the index to the tag.


This technique is necessary when bit flipping is implemented.
b[a]
his miss

done na
hit miss
l
.¢»

swap dobber2

I , long memory reference


done
who
3
l
done
done: access complete
M+3 *I

Figure 5: Decision tree for the hash-rehash algorithm.


Cache Optimization Summary

Teclmiqzle MP AIR HT Coz/lp/evity

Multilevel caches + 2

Early Restart & Critical Word 1st + 2

Priority to Read Misses + 1

Merging write buffer + 1

Victim Caches + + 2

Larger Block Size + 0

Higher Associativi + 1

Pseudo-Associative Caches + 2

Compiler Reduce Misses + 0

q
Non-Blocking Caches + J

HW Prefetching of Instr"Data + + 2/3


q
Compiler Controlled Prefetching + + J

Avoiding Address Translation + 2


q
Trace Cache + D

You might also like