Computer Architecture:
Memory Hierarchy design
Dr. Ashok Kumar Turuk
1
Memory Hierarchy Design
Cache Performance
Techniques to Improve Memory Performance
Memory Organization Technology, case study
2
What is a cache?
Small, fast storage used to improve average access time to slow memory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers “a cache” on variables – software managed
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
TLB a cache on page table
Branch-prediction a cache on prediction information?
Proc/Regs
L1-Cache
Bigger L2- Faster
Cache
Memory
3
Disk, Tape,
etc.
Some terms
Cache hit: when the CPU finds a requested data item in the cache.
Cache miss: when the CPU does not find the requested data item in the cache.
Block: a fixed-size collection of data, which is retrieved from the main memory
and placed into cache. (cache unit)
Temporal locality: recently accessed data items are likely to be accessed in the
near future.
Spatial locality: items whose addresses are near one another tend to be
referenced closed together in time.
The time required for the cache miss depends on both latency and bandwidth of
the memory. Latency determines the time to retrieve the first word of the block,
and bandwidth determines the time to retrieve the rest of this block.
Virtual memory: the address space is usually broken into fixed number of
blocks (pages). At any time, each page resides either in main memory or on
disk. When the CPU references an item within a page that is not in the cache or
main memory, a page fault occurs, and the entire page is then moved from the
disk to main memory.
The cache and main memory have the same relationship as the main memory
and disk.
4
Review: Cache performance
Memory stall cycles: the number of cycles during
which CPU is stalled waiting for memory access.
CPUtime=(CPUclock cycles+Memstall cycles) x Cycle time
Memstall cycles=# of Misses x Miss Penalty
Misses
= IC x Miss Penalty
Instruction
= IC MemAccess
Instruction MissRate MissPenalty
Miss Rate: the fraction of cache accesses that result in a
miss.
5
Impact on Performance
Example: Assume we have a computer where CPI is 1.0 when all memory
accesses hit the cache. The only data accesses are loads and stores, and these
total 50% of the instructions. If the miss penalty is 25 clock cycles and the
miss rate is 2%, how much fast the computer would be, and what is that if all
instructions were cache hits?
CPI = 1, Miss Penalty = 25, Miss Rate = 0.02,
Memory References Per Instruction = 1 + 0.5
1. Computer that always hits:
CPU execution time
= (CPU clock cycles + memory
stall cycles) * CCT
= (IC*CPI+0) * CCT = IC*CCT
2. Computer with real cache
Memory stall cycles = IC * MRPI * MR* MP
= IC * (1+0.5) * 0.02 * 25 = 0.75 * IC
CPU execution time
= (CPU clock cycles + memory stall cycles) * CCT
= (IC + 0.75 * IC)*CCT
CPU execution time with cache / CPU execution time = 1.75 6
Traditional Four Questions for Memory
Hierarchy Designers
Q1: Where can a block be placed in the upper level? (Block
placement)
Fully Associative, Set Associative, Direct Mapped
Q2: How is a block found if it is in the upper level?
(Block identification)
Tag/Block
Q3: Which block should be replaced on a miss?
(Block replacement)
Random, LRU
Q4: What happens on a write?
(Write strategy)
Write Back or Write Through (with Write Buffer)
7
Q1: Where can a block be placed in the upper
level? (Block placement)
Fully associative: any block in the main memory can be placed
in any block frame.
It is flexible but expensive due to associativity
Direct mapping: each block in memory is placed in a fixed block
frame with the following mapping function:
block frame # = (Block address ) MOD (#of block frames in cache)
It is inflexible but simple and economical.
8
Q1: Where can a block be placed in the upper
level? (Block placement)
Set associative: The cache is divided into sets of block frames,
and each block from the memory is first mapped to a fixed set
wherein the block can be placed in any block frame. Mapping to a
set follows the function, called a bit selection:
set # = (Block address) MOD (# of sets in cache)
n-way set associative: there are n blocks in a set;
fully associative is an m-way set associative if there are m block frames in
the cache; whereas, direct mapping is one-way set associative
one-way, two-way, and four-way are the most frequently used methods.
9
Figure B.2 This example cache has eight block frames and memory has 32 blocks. The three options for caches are shown left
to right. In fully associative, block 12 from the lower level can go into any of the eight block frames of the cache. With direct
mapped, block 12 can only be placed into block frame 4 (12 modulo 8). Set associative, which has some of both features, allows
the block to be placed anywhere in set 0 (12 modulo 4). With two blocks per set, this means block 12 can be placed either in block
0 or in block 1 of the cache. Real caches contain thousands of block frames, and real memories contain millions of blocks. The set
associative organization has four sets with two blocks per set, called two-way set associative. Assume that there is nothing in the
cache and that the block address in question identifies lower-level block 12.
Copyright © 2011, Elsevier Inc. All rights Reserved. 10
Q2: How is a block found if it is in the upper
level?(Block identification)
Each block frame in the cache has an address tag indicating the
block's address in the memory
All possible tags are searched in parallel
A valid bit is attached to the tag to indicate whether the block contains valid
information or not
An address for a datum from CPU, A, is divided into
block address field and block offset field:
block address = A / block size
block offset = (A) MOD (block size)
block address is further divided into tag and index
index indicates the set in which the block may reside, while tag is compared
to indicate a hit or a miss
11
Q3: Which block should be replaced on a miss?
(Block replacement)
On a miss the cache controller must select a block be
replaced with the desired data.
The more choices for replacement, the more expensive for hardware
Direct mapping is the simplest
Random, LRU, FIFO
Random vs. Least-recently used (LRU): the former has uniform
allocation and is simple to build while the latter can take
advantage of temporal locality but can be expensive to implement
12
13
Q4: What happens on a write?
(Write strategy)
Most cache accesses are reads:
All instruction accesses are reads,
Most instructions don’t write to memory.
Optimize reads to make the common case fast
Processor wait for the read to complete, but doesn't wait for writes
Read is easy: reading and tag comparison can be done in parallel
Write is hard: cannot overlap tag reading and block writing (destructive)
CPU specifies write size: only 1 - 8 bytes
Write strategies often distinguish cache design:
Write through (or store through): write information to blocks
in both levels ensuring consistency at the cost of memory and bus
bandwidth
Write stalls may be alleviated by using write buffers by
overlapping processor execution with memory updating
14
Q4: What happens on a write?
(Write strategy)
Write back (store in): write information to blocks only in cache
level
Minimizing memory and bus traffic at the cost of weak consistency
Use dirty bit to indicate modification, reduce frequency of write-back
on replacement
Read misses may result in writes
On a write miss: the data are not needed
Write allocate (The block is allocated on a write miss): used in write-back
No-write allocate (write miss does not affect cache, modified in lower-
level cache): used in write-through
15
• Example: Assume a fully associative write–back
cache with many cache entries that start empty.
Below is a sequence of five memory operations (the
address is in square brackets):
Write Mem[100];
Write
Mem[100]; Read
Mem[200];
Write Mem[200];
Write
Mem[100];
What are the number of hits and misses when using no-write
allocate versus write allocate?
16
Example: Alpha 21264 Data Cache
For 2 set associative,
use round-robin
(FIFO) to choose
where to go
16 bytes Cache miss
Figure 5.7
17
Example: Fig. 5.7 Alpha AXP 21064 data cache (64-bit
machine). Cache size = 65,536 bytes (64K), block size = 64 bytes,
with two- way set-associative placement, write back, write allocate
on a write miss, 44 bit physical address. What is the index size?
Index size = f(cache size, block size, associativity)
2index size = number of sets in cache
2index size = cache size / (block size * set associativity)
= 216 / (26 * 2) = 29
block offset size = 6 64 byte block
tag length = 44 – 6 – 9 =29 bit.
18
Unified vs Split Caches
• Unified vs Separate I&D
Proc
Unified I-Cache-1 Proc D-Cache-1
Cache- Unified
1 Cache-2
Unified
Cache-
2
• Example:
– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%
– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?
– Assume 33% data ops 75% accesses from instructions
(1.0/1.33)
– hit time=1, miss time=50
19
– Note that data hit has 1 stall for unified cache (only one port)
Discussion: single (unified) cache vs. separate cache different miss rates
(74% instruction vs. 26% data), see figure 5.8.
(May have structural hazards from Load/Store with a single/unified
cache)
Size Instructi Data Unifie
on Cach d
cache e Cache
8 KB 8.16 44.0 63.0
16 KB 3.82 40.9 51.0
32 KB 1.36 38.4 43.3
64 KB 0.61 36.9 39.4
128 KB 0.30 35.3 36.2
256 KB 0.02 32.6 32.9
Figure 5.8 Miss per 1000 instructions
for inst, data and unified caches of diff
sizes 20