CPU Logic Design
Li Yamin
Department of Computer Science
Faculty of Computer and Information Sciences
Hosei University, Tokyo 184-8584 Japan
http://cis.k.hosei.ac.jp/∼yamin/
LYM, Dept. of CS, Hosei University Cache – p.1/27
Performance
100,000
10,000
1,000
100
10
1
19
80
19
81
19
82
CPU Logic Design
19
83
19
84
LYM, Dept. of CS, Hosei University
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
CPU
93
19
94
19
95
19
Memory
96
19
97
19
98
19
99
20
00
20
Cache – p.2/27
02
CPU Logic Design
Memory hierarchy
Speed Size
Fastest Registers smallest
On-chip cache
Off-chip cache
Main memory
Disk
Slowest Tape Biggest
LYM, Dept. of CS, Hosei University Cache – p.3/27
CPU Logic Design
Capacity Access Time
Registers ∼ 1KB 0.5 ∼ 1ns
On-chip cache 16 ∼ 128KB 1 ∼ 10ns
Off-chip cache 128K ∼ 4MB 10 ∼ 20ns
Main memory 256MB ∼ 4GB 40 ∼ 80ns
Hard disk 20 ∼ 500GB 1 ∼ 10ms
Tape ∼ 10TB 10ms ∼ 10s
LYM, Dept. of CS, Hosei University Cache – p.4/27
CPU Logic Design
Processor
DO
IR
DI Instruction Memory
Cache
PC A
DI D
DO
DO
DI Data
Cache
DA A
LYM, Dept. of CS, Hosei University Cache – p.5/27
CPU Logic Design
Cache Size
Size
Block Size
Physical Address Cache
Cache Placement
Virtual Address Cache
Direct Mapping
Mapping Algorithms Set Associate Mapping
Parameters Full Associate Mapping
Write Back
Memory Update Mechanism
Write Through
Write Allocate
Cache Write Miss
No-Write Allocate
Random Replacement Algorithms
Replacement Algorithms
LRU Replacement Algorithms
LYM, Dept. of CS, Hosei University Cache – p.6/27
CPU Logic Design
Q1: Block placement – Where can a block be placed
in the upper level?
Q2: Block identification – How is a block found if it is in
the upper level?
Q3: Block replacement – Which block should be
replaced on a miss?
Q4: Write strategy – What happens on a write?
LYM, Dept. of CS, Hosei University Cache – p.7/27
CPU Logic Design
Q1: Where can a block be placed in the upper level?
If each block has only one place it can appear in the
cache, the cache is said to be direct mapped.
If a block can be placed anywhere in the cache, the
cache is said to be fully associative.
If a block can be placed in a restricted set of places in
the cache, the cache is said to be set associative. A
set is a group of blocks in the cache. A block is first
mapped onto a set, and then the block can be placed
anywhere within that set. If there are n blocks in a set,
the cache placement is called n-way set associative.
LYM, Dept. of CS, Hosei University Cache – p.8/27
CPU Logic Design
Fully associative Direct mapped Set associative
block 12 can go block 12 can go block 12 can go
anywhere only into block 4 anywhere in set 0
(12 mod 8) (12 mod 4)
Block Block Block
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
number number number
Cache
Set Set Set Set
0 1 2 3
Block 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
number
Memory
LYM, Dept. of CS, Hosei University Cache – p.9/27
CPU Logic Design
Q2: How is a block found if it is in the upper level?
Caches have an address tag on each block frame that
gives the block address.
The tag is checked to see if it matches the block
address from the CPU.
There must be a way to know that a cache block does
not have valid information. The most common
procedure is to add a valid bit to the tag to say
whether or not this entry contains a valid address.
The figure in the next page shows how a CPU
address is divided.
LYM, Dept. of CS, Hosei University Cache – p.10/27
CPU Logic Design
The three portions of an address in a set-associative
or direct-mapped cache:
Block address Block
Tag Index offset
The first division is between the block address and
the block offset. The block frame address can be
further divided into the tag field and the index field.
The block offset field selects the desired data from
the block, the index field selects the set, and the tag
field is compared against it for a hit.
LYM, Dept. of CS, Hosei University Cache – p.11/27
CPU Logic Design
Block
Block address
offset
<21> <8> <5> Address
Tag Index Data in
2
Data out
Valid Tag Data
<1> <21> <256>
256
blocks Write
buffer
mux
=? 64
Lower level memory
Hit
LYM, Dept. of CS, Hosei University Cache – p.12/27
CPU Logic Design
Block
Block address
offset
<22> <7> <5> Address
Tag Index Data in
2
Data out
Valid Tag Data
<1> <22> <256>
128
blocks Write
buffer
mux mux
=? =?
Lower
mux level
64 memory
LYM, Dept. of CS, Hosei University Cache – p.13/27
CPU Logic Design
Q3: Which block should be replaced on a miss?
When a miss occurs, the cache controller must select
a block to be replaced with the desired data.
A benefit of direct-mapped placement is that
hardware decisions are simplified — in fact, so simple
that there is no choice: Only one block is selected for
a hit, and one that block can be replaced.
With fully associative or set associative placement,
there are many blocks to choose from on a miss.
There are two primary strategies employed for
selecting which block to replace — random and
least-recently used (LRU).
LYM, Dept. of CS, Hosei University Cache – p.14/27
CPU Logic Design
Random Strategy:
To spread allocation uniformly, candidate blocks are
randomly selected.
Some system generate pseudo-random block
numbers to get reproducible behavior, which is
partially useful when debugging hardware.
A virtue of random replacement is that it is simple to
build in hardware.
LYM, Dept. of CS, Hosei University Cache – p.15/27
CPU Logic Design
Least-recently used (LRU) Strategy:
To reduce the chance of throwing out information that
will be needed soon, accesses to block are recorded.
The block replaced is the one that has been unused
for the longest time.
LRU makes use of a corollary of locality: If recently
used blocks are likely to be used again, then the best
candidate for disposal is the least-recently used block.
LYM, Dept. of CS, Hosei University Cache – p.16/27
CPU Logic Design
When a miss occurs in a cache, the incoming data
has to replace something already in the cache.
The least recently used , or LRU, replacement
algorithm tries to find which block was used least
recently. We must keep track of the most recent time
that a block is used.
LYM, Dept. of CS, Hosei University Cache – p.17/27
CPU Logic Design
Consider a set size of 8 (8-way set associate cache).
LRU could be realized by assigning a 3-bit
downcounter to each block. When the cache is
flushed, all counters are reset. The rules are:
Replace at count 0 (any 0 will do).
Upon a hit (including replacement), set the hit block
to 7 and decrement all other counters, but the
counter cannot go below zero.
On cache reset, reset all counters as well as valid
bits.
LYM, Dept. of CS, Hosei University Cache – p.18/27
CPU Logic Design
If the set size is 4 (4-way set associate cache), LRU
could be realized by assigning a 2-bit downcounter to
each block.
The counter value 3 means that set is used most
recently; 2 means that set is used more recently; ...;
and 0 means that set is used least recently.
LYM, Dept. of CS, Hosei University Cache – p.19/27
CPU Logic Design
If the set size is 2 (2-way set associate cache), the
one which was not most recently used must be the
least recently used. So it is not necessary to use two
1-bit counters in the two sets. A 1-bit counter is
enough. The counter value 1 means the set 1 is used
most recently, for instance, it can be understand that
the set 0 must be used least recently.
By assigning a 4-bit downcounter to each block can
realize LRU for a 16-way set associate cache. But the
16-way set associate cache was not found in real
design. Assigning a 4-bit downcounter to each block
will increase the hardware burden greatly.
LYM, Dept. of CS, Hosei University Cache – p.20/27
CPU Logic Design
(1) Replace S3: (2) Replace S1:
S3 S2 S1 S0 S3 S2 S1 S0
0 3 0 0 0 2 0 3 0
1 0 0 0 0 0 0 0 0
LRU counter ... ... ... ... ... ... ... ... ...
(2-bit)
7 0 0 0 0 0 0 0 0
(3) Replace S0: (4) Replace S2:
S3 S2 S1 S0 S3 S2 S1 S0
1 0 2 3 0 3 1 2
0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ...
0 0 0 0 0 0 0 0
LYM, Dept. of CS, Hosei University Cache – p.21/27
CPU Logic Design
2-way set-associate cache Tag 3 2
S1
2AE1 4 1 Word address S0
d3 d2 d1 d0 V Tag LRU Tag V d3 d2 d1 d0
0 0 0 1 0
1 1 1 0 1
2 1 0 1 2
3 0 0 3
4 1 2AE1 1 15D3 1 4
5 0 0 1 5
6 1 1 0 6
7 1 1 1 7
COMP COMP
Select Select
MUX MUX
OutputEnable OutputEnable
Data out Hit Data out
LYM, Dept. of CS, Hosei University Cache – p.22/27
CPU Logic Design
Q4: What happens on a write?
There are two basic options when writing to the
cache:
1. Write-through (or store through) — The information
is written to both the block in the cache and the
block in the lower-level memory.
2. Write back (or copy back or store in) — The
information is written only to the block in the cache.
The modified cache block is written to main
memory only when it is replaced.
LYM, Dept. of CS, Hosei University Cache – p.23/27
CPU Logic Design
There are two options on a write miss:
1. Write allocate — The block is loaded on a write
miss. This is similar to a read miss.
2. No-write allocate — The block is modified in the
lower level and not loaded into the cache.
Although either write-miss policy could be used with
write through or write back, write-back caches
generally use write allocate (hopping that subsequent
writes to that block will be captured by the cache) and
write-through caches often use no-write allocate
(since subsequent writes to that block will still have to
go to memory).
LYM, Dept. of CS, Hosei University Cache – p.24/27
CPU Logic Design
A valid bit is added to units smaller than the full block,
called sub-blocks.
Only a single sub-block need be read on a miss.
Tag V V V V
100 1 1 1 1
300 1 1 0 0
200 0 1 0 1
204 0 0 0 0
SUb-block
LYM, Dept. of CS, Hosei University Cache – p.25/27
CPU Logic Design
Tc : cache access time
Tm : main memory access time
h: hit ratio; 1 − h: miss ratio
Average memory access time
T = hTc + (1 − h)(Tc + Tm ) = Tc + (1 − h)Tm
Example:
Tc = 1ns, Tm = 50ns, h = 95%
T = Tc + mTm = 1 + 0.05 × 50 = 1 + 2.5 = 3.5ns
Speedup = 50/3.5 = 14.29 = 1429%
LYM, Dept. of CS, Hosei University Cache – p.26/27
CPU Logic Design
First-level caches in the Pentium Pro and PowerPC 604
Characteristic Intel Pentium Pro PowerPC 604
Cache organization Split instruction Split instruction
and data caches and data caches
Cache size 8KB each for 16KB each for
instructions/data instructions/data
Cache associativity Four-way set Four-way set
associative associative
Replacement Approximated LRU LRU
Block size 32 bytes 32 bytes
Write policy Write-back Write-back or Write-through
LYM, Dept. of CS, Hosei University Cache – p.27/27