13 - Large and Fast Exploiting Memory Hierarchy Final
13 - Large and Fast Exploiting Memory Hierarchy Final
Exploiting Memory
Hierarchy
CS353 – Computer Architecture
Najeeb-Ur-Rehman
Assistant Professor
Department of Computer Science
Faculty of Computing & IT
University of Gujrat
§5.1 Introduction
Memory Technology
• Static RAM (SRAM)
• 0.5ns – 2.5ns, $2000 – $5000 per GB
• Dynamic RAM (DRAM)
• 50ns – 70ns, $20 – $75 per GB
• Magnetic disk
• 5ms – 20ms, $0.20 – $2 per GB
• Ideal memory
• Access time of SRAM
• Capacity and cost/GB of disk
15
Principle of Locality
16
Principle of Locality of Reference
17
18
19
20
Taking Advantage of Locality
• Memory hierarchy
• Store everything on disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM
memory
• Main memory
• Copy more recently accessed (and nearby) items from DRAM to
smaller SRAM memory
• Cache memory attached to CPU
21
Memory Hierarchy Levels
• Block (aka line): unit of copying
• May be multiple words
• If accessed data is present in upper
level
• Hit: access satisfied by upper level
• Hit ratio: hits/accesses
• If accessed data is absent
• Miss: block copied from lower level
• Time taken: miss penalty
• Miss ratio: misses/accesses
= 1 – hit ratio
• Then accessed data supplied from upper
level
Basic Terminology
• Cache -- name given to the first level of memory seen
from the CPU
• Miss rate -- fraction of memory accesses that are not in
the cache
• Miss penalty -- additional clock cycles needed to service a
cache miss
• Hit time -- time to hit in the cache
• Block -- smallest unit of data that can be accessed from
main memory
11-Aug-17 23
§5.2 The Basics of Caches
Cache Memory
• Cache memory
• The level of the memory hierarchy closest to the CPU
• Given accesses X1, …, Xn–1, Xn
How do we know if
the data is present?
Where do we look?
24
The Need for Cache Memory
• Widening speed gap between CPU and main memory
• Processor operation takes less than 1 ns
• Main memory requires more than 50 ns to access
25
Typical Memory Hierarchy
• Registers are at the top of the hierarchy
• Typical size < 1 KB
• Access time < 0.5 ns
• Level 1 Cache (8 – 64 KB)
• Access time: 1 ns
• L2 Cache (512KB – 8MB) Microprocessor
Bigger
Faster
• Disk Storage (> 200 GB) Memory Bus
I/O Bus
Magnetic or Flash Disk 26
What is a Cache Memory ?
• Small and fast (SRAM) memory technology
• Stores the subset of instructions & data currently being accessed
• Used to reduce average access time to memory
• Caches exploit temporal locality by …
• Keeping recently accessed data closer to the processor
• Caches exploit spatial locality by …
• Moving blocks consisting of multiple contiguous words
• Goal is to achieve
• Fast speed of cache memory access
• Balance the cost of the memory system
27
Almost Everything is a Cache !
28
Four Basic Questions on Caches
• Q1: Where can a block be placed in a cache?
• Block placement
• Direct Mapped, Set Associative, Fully Associative
• Q2: How is a block found in a cache?
• Block identification
• Block address, tag, index
• Q3: Which block should be replaced on a miss?
• Block replacement
• FIFO, Random, LRU
• Q4: What happens on a write?
• Write strategy
• Write Back or Write Through (with Write Buffer)
29
Block Placement: Direct Mapped
• Block: unit of data transfer between cache and memory
• Direct Mapped Cache:
• A block can be placed in exactly one location in the cache
000
001
010
100
101
011
110
111
In this example:
Cache
Cache index =
least significant 3 bits
of Memory address
Memory
Main
00000
00001
00010
00100
00101
01000
01001
01010
10000
10001
10010
10100
10101
11001
00011
00110
01011
10011
10110
11000
11010
01100
01101
00111
10111
11011
01110
11100
11101
01111
11110
11111
30
Direct-Mapped Cache
• A memory address is divided into
Block Address
• Block address: identifies block in memory
Tag Index offset
• Block offset: to access bytes within a block
• A block address is further divided into V Tag Block Data
• Index: used for direct cache access
• Tag: most-significant bits of block address
Index = Block Address mod Cache Blocks
• Tag must be stored also inside cache
• For block identification
=
• A valid bit is also required to indicate
• Whether a cache block is valid or not Data
Hit
31
Direct Mapped Cache – cont’d
• Cache hit: block is stored inside cache
Block Address
• Index is used to access cache block
• Address tag is compared against stored tag Tag Index offset
32
6-bit Address Main Memory
Direct Mapping 00 00 00
00 01 00
5600
3223
Cache 00 10 00 23
00 11 00 1122
Valid
01 00 00 0
Index Tag Data 01 01 00 32324
00 Y 00 5600 01 10 00 845
01 Y 11 775 01 11 00 43
10 Y 01 845 10 00 00 976
11 N 00 33234 10 01 00 77554
10 10 00 433
In a direct-mapped cache: 10 11 00 7785
-Each memory address 11 00 00 2447
corresponds to one location in the 11 01 00 775
cache 11 10 00 433
-There are many different 11 11 00 3649
memory locations for each cache Tag
entry (four in this case) Index Always zero (words)
33
Direct Mapped Cache
• Location determined by address
• Direct mapped: only one choice
• (Block address) modulo (#Blocks in cache)
#Blocks is a
power of 2
Use low-order
address bits
34
Direct Mapping - Address Structure
Tag s-r
Line or Slot r Word w or Offset
8 14 2
• Find Cache Size?
• 24 bit address
• 2 bit word identifier (4 byte block)
• 22 bit block identifier
• 14 bit slot or line
• 8 bit tag (=22-14)
• No two blocks in the same line have the same Tag field
• Check contents of cache by finding line and checking Tag
35
Direct Mapping - Cache Line Table
37
Tags and Valid Bits
38
Cache Example
• 8-blocks, 1 word/block, direct mapped
• Initial state
39
Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Miss 110
40
Cache Example
Word addr Binary addr Hit/miss Cache block
26 11 010 Miss 010
41
Cache Example
Word addr Binary addr Hit/miss Cache block
22 10 110 Hit 110
26 11 010 Hit 010
42
Cache Example
Word addr Binary addr Hit/miss Cache block
16 10 000 Miss 000
3 00 011 Miss 011
16 10 000 Hit 000
43
Cache Example
Word addr Binary addr Hit/miss Cache block
18 10 010 Miss 010
44
Mapping an Address to a Cache Block
• Example
• Consider a direct-mapped cache with 256 blocks
• Block size = 16 bytes
• Compute tag, index, and block offset of address: 0x01FFF8AC
Block Address
• Solution 20 8 4
45
Example: Larger Block Size
•64 blocks, 16 bytes/block
•Tag? Index ? Block Offset?
•To what block number does address 1200
map?
•Block address = 1200/16 = 75
•Block number = 75 modulo 64 = 11
31 10 9 4 3 0
Tag Index Offset
22 bits 6 bits 4 bits
47
Example: Derive Index(Block) and Offset (Word bits).
Given
51
Write-Back
52
Write Allocation
• What should happen on a write miss?
• Alternatives for write-through
• Allocate on miss: fetch the block
• Write around: don’t fetch the block
• Since programs often write a whole block before reading it (e.g., initialization)
• For write-back
• Usually fetch the block
53
Example: Intrinsity FastMATH
54
Main Memory Supporting Caches
• Use DRAMs for main memory • 15 bus cycles per DRAM access
• Fixed width (e.g., 1 word) • 1 bus cycle per data transfer
• Connected by fixed-width • For 4-word block, 1-word-wide
clocked bus DRAM
• Bus clock is typically slower than • Miss penalty = 1 + 4×15 + 4×1 =
CPU clock
65 bus cycles
• Example cache block read • Bandwidth = 16 bytes / 65
• 1 bus cycle for address transfer cycles = 0.25 B/cycle
56
Increasing Memory Bandwidth
4-bank interleaved memory
Miss penalty
= 1 + 15 + 4×1 = 20 bus
cycles
Bandwidth = 16 bytes / 20
cycles = 0.8 B/cycle
4-bank interleaved memory
Miss penalty = 1 + 15 + 4×1 =
20 bus cycles
Bandwidth = 16 bytes / 20
cycles = 0.8 B/cycle
57
Advanced DRAM Organization
• Bits in a DRAM are organized as a
rectangular array
• DRAM accesses an entire row
• Burst mode: supply successive words from a
row with reduced latency
• Double data rate (DDR) DRAM
• Transfer on rising and falling clock edges
• Quad data rate (QDR) DRAM
• Separate DDR inputs and outputs
58
DRAM Generations
Year Capacity $/GB 300
59
Measuring Cache Performance
• Components of CPU time
• Program execution cycles
• Includes cache hit time
• Memory stall cycles
• Mainly from cache misses
• With simplifying assumptions:
60
Cache Performance Example
• Given
• I-cache miss rate = 2%
• D-cache miss rate = 4%
• Miss penalty = 100 cycles
• Base CPI (ideal cache) = 2
• Load & stores are 36% of instructions
• Miss cycles per instruction
• I-cache: 0.02 × 100 = 2
• D-cache: 0.36 × 0.04 × 100 = 1.44
• Actual CPI = 2 + 2 + 1.44 = 5.44
• Ideal CPU is 5.44/2 =2.72 times faster
61
Average Access Time
• Hit time is also important for performance
• Average Memory Access Time (AMAT)
• AMAT = Hit time + Miss rate × Miss penalty
• Example
• CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20
cycles, I-cache miss rate = 5%
• AMAT = ?
• AMAT = 1 + 0.05 × 20 = 2ns
• 2 cycles per instruction
62
Performance Summary
63
Associative Caches
• Fully associative
• Allow a given block to go in any cache entry
• Requires all entries to be searched at once
• Comparator per entry (expensive)
• n-way set associative
• Each set contains n entries
• Block number determines which set
• (Block number) modulo (#Sets in cache)
• Search all entries in a given set at once
• n comparators (less expensive)
64
Associative Cache Example
65
Spectrum of Associativity
• For a cache with 8 entries
66
Associativity Example
• Compare 4-block Block Cache Hit/mis
address index s
Cache content after access
0 1 2 3
caches
0 0 miss Mem[0]
• Direct mapped, 2-way 8 0 miss Mem[8]
set associative, 0 0 miss Mem[0]
fully associative 6 2 miss Mem[0] Mem[6]
• Block access 8 0 miss Mem[8] Mem[6]
sequence:
0, 8, 0, 6, 8
• Direct mapped
67
Associativity Example
• 2-way set associative
Block Cache Hit/miss Cache content after access
address index Set 0 Set 1
0 0 miss Mem[0]
8 0 miss Mem[0] Mem[8]
0 0 hit Mem[0] Mem[8]
6 0 miss Mem[0] Mem[6]
8 0 miss Mem[8] Mem[6]
Fully associative
Block Hit/miss Cache content after access
address
0 miss Mem[0]
8 miss Mem[0] Mem[8]
0 hit Mem[0] Mem[8]
6 miss Mem[0] Mem[8] Mem[6]
8 hit Mem[0] Mem[8] Mem[6]
68
How Much Associativity
•Increased associativity decreases miss rate
• But with diminishing returns
•Simulation of a system with 64KB
D-cache, 16-word blocks, SPEC2000
• 1-way: 10.3%
• 2-way: 8.6%
• 4-way: 8.3%
• 8-way: 8.1%
69
Associative Mapping
70
71
72
(a). Main memory capacity = 16MByte = 16 * 1MByte = 24 * 220 = 224
Therefore the number of bits for the main memory address = 24 bits
(b) Given a cache memory of the size 16 Kline = 16 * Kline = 24 * 210 = 214 locations
(line number) – not important…! (Since the any word can be loaded into any cache location)
(c) Each line stores a block of words.
Block size = 4 words (4x8bit=32bit) = 22 words Example: Given
Therefore 2 bits for block of words M.M= 16MB,
(d) Memory address format consists of tag and word field. Cache has 16K Lines,
Block Size=4 Words,
Tag bits + words bits = 24 bits
Devise Memory address
Tag bits + 2 = 24 bits format
Therefore tag bits = 24 – 2 = 22 bits
Draw the format of main memory address
78
Replacement Policy
• Direct mapped: no choice
• Set associative
• Prefer non-valid entry, if there is one
• Otherwise, choose among entries in the set
• Least-recently used (LRU)
• Choose the one unused for the longest time
• Simple for 2-way, manageable for 4-way, too hard beyond that
• Random
• Gives approximately the same performance as LRU for high
associativity
80
Multilevel Caches
• Primary cache attached to CPU
• Small, but fast
• Level-2 cache services misses from primary
cache
• Larger, slower, but still faster than main memory
• Main memory services L-2 cache misses
• Some high-end systems include L-3 cache
81
Multilevel Cache Example
•Given
• CPU base CPI = 1, clock rate = 4GHz
• Miss rate/instruction = 2%
• Main memory access time = 100ns
•With just primary cache
• Miss penalty = 100ns/0.25ns = 400 cycles
• Effective CPI = 1 + 0.02 × 400 = 9
82
Example (cont.)
• Now add L-2 cache
• Access time = 5ns
• Global miss rate to main memory = 0.5%
• Primary miss with L-2 hit
• Penalty = 5ns/0.25ns = 20 cycles
• Primary miss with L-2 miss
• Extra penalty = 500 cycles
• CPI = 1 + 0.02 × 20 + 0.005 × 400 = 3.4
• Performance ratio = 9/3.4 = 2.6
83
Multilevel Cache Considerations
• Primary cache
• Focus on minimal hit time
• L-2 cache
• Focus on low miss rate to avoid main memory
access
• Hit time has less overall impact
• Results
• L-1 cache usually smaller than a single cache
• L-1 block size smaller than L-2 block size
84
Interactions with Advanced CPUs
85
Cache Design Trade-offs
Design Change Effect on Miss Rate Negative Performance Effect
Increase cache size Decrease capacity May increase access time
misses
Increase Decrease conflict May increase access time
associativity misses
Increase block size Decrease Increases miss penalty. For
compulsory misses very large block size, may
increase miss rate due to
pollution.
86
§5.8 Parallelism and Memory Hierarchies: Cache Coherence
Cache Coherence Problem
• Suppose two CPU cores share a physical address
space
• Write-through caches
Time Event CPU A’s CPU B’s Memory
step cache cache
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
87
Coherence Defined
• Informally: Reads return most recently written value
• Formally:
• P writes X; P reads X (no intervening writes)
read returns written value
• P1 writes X; P2 reads X (sufficiently later)
read returns written value
• c.f. CPU B reading X after step 3 in example
• P1 writes X, P2 writes X
all processors see writes in the same order
• End up with the same final value for X
88
Cache Coherence Protocols
• Operations performed by caches in multiprocessors to
ensure coherence
• Migration of data to local caches
• Reduces bandwidth for shared memory
• Replication of read-shared data
• Reduces contention for access
• Snooping protocols
• Each cache monitors bus reads/writes
• Directory-based protocols
• Caches and memory record sharing status of blocks in a
directory 89
Invalidating Snooping Protocols
• Cache gets exclusive access to a block when it is to be
written
• Broadcasts an invalidate message on the bus
• Subsequent read in another cache misses
• Owning cache supplies updated value
90
Memory Consistency
• When are writes seen by other processors
• “Seen” means a read returns the written value
• Can’t be instantaneously
• Assumptions
• A write completes only when all processors have seen it
• A processor does not reorder writes with other accesses
• Consequence
• P writes X then writes Y
all processors that see new Y also see new X
• Processors can reorder reads, but not writes
91
§5.10 Real Stuff: The AMD Opteron X4 and Intel Nehalem
Multilevel On-Chip Caches
Intel Nehalem 4-core processor
92
Examples
Mapping
93
Memory System Performance
• The hit time is how long it takes data to be sent from the cache to the
processor. This is usually fast, on the order of 1-3 clock cycles
• The miss penalty is the time to copy data from main memory to the
cache. This often requires dozens of clock cycles (at least)
• The miss rate is the percentage of misses
94
Average Memory Access Time
• The average memory access time, or AMAT, can then be
computed
AMAT = Hit time + (Miss rate x Miss penalty)
• This is just averaging the amount of time for cache hits and the
amount of time for cache misses
• Obviously, a lower AMAT is better
• Miss penalties are usually much greater than hit times, so the
best way to lower AMAT is to reduce the miss penalty or the
miss rate
95
Performance Example 1
• Assume that 33% of the instructions in a program are data
accesses. The cache hit ratio is 97% and the hit time is one
cycle, but the miss penalty is 20 cycles.
AMAT = Hit time + (Miss rate x Miss penalty)
= 1 cycle + (3% x 20 cycles)
= 1.6 cycles
• If the cache was perfect and never missed, the AMAT
would be one cycle. But even with just a 3% miss rate, the
AMAT here increases 1.6 times!
96
Performance Example 2
A CPU has access to 2 levels of memory. Level 1 contains
1000 words and has access time 0.01 ms; level 2 contains
100,000 words and has access time 0.1 ms. Assume that
if a word to be accessed is in level 1, then, the CPU
accesses it directly. If it is in level 2, then the word is first
transferred to level 1 and then accessed by the CPU. For
simplicity, we ignored the time required for the CPU to
determine whether the word is in level 1 or level 2.
Suppose 95% of the memory access are found in the
cache. Then, the average access to a word can be
expressed as:
97
Peformance Example 3
98
§5.4 Virtual Memory
Virtual Memory
99
Address Translation
• Fixed-size pages (e.g., 4K)
100
Interactions with Software
• Misses depend on
memory access
patterns
• Algorithm behavior
• Compiler optimization
for memory access
101
Page Fault Penalty
102
Page Tables
103
Translation Using a Page Table
104
Mapping Pages to Storage
105
Replacement and Writes
106
Fast Translation Using a TLB
107
Fast Translation Using a TLB
108
TLB Misses
• If page is in memory
• Load the PTE from memory and retry
• Could be handled in hardware
• Can get complex for more complicated page table structures
• Or in software
• Raise a special exception, with optimized handler
• If page is not in memory (page fault)
• OS handles fetching the page and updating the page table
• Then restart the faulting instruction
109
TLB Miss Handler
110
Page Fault Handler
111
TLB and Cache Interaction
• If cache tag uses physical
address
• Need to translate before
cache lookup
• Alternative: use virtual
address tag
• Complications due to
aliasing
• Different virtual
addresses for shared
physical address
Memory Protection
113
§5.5 A Common Framework for Memory Hierarchies
The Memory Hierarchy
The BIG Picture
• Common principles apply at all levels of the memory
hierarchy
• Based on notions of caching
• At each level in the hierarchy
• Block placement
• Finding a block
• Replacement on a miss
• Write policy
114
Block Placement
• Determined by associativity
• Direct mapped (1-way associative)
• One choice for placement
• n-way set associative
• n choices within a set
• Fully associative
• Any location
• Higher associativity reduces miss rate
• Increases complexity, cost, and access time
115
Finding a Block
Associativity Location method Tag comparisons
Direct mapped Index 1
n-way set Set index, then search n
associative entries within the set
Fully associative Search all entries #entries
Full lookup table 0
• Hardware caches
• Reduce comparisons to reduce cost
• Virtual memory
• Full table lookup makes full associativity feasible
• Benefit in reduced miss rate
116
Replacement
117
Write Policy
• Write-through
• Update both upper and lower levels
• Simplifies replacement, but may require write buffer
• Write-back
• Update upper level only
• Update lower level when block is replaced
• Need to keep more state
• Virtual memory
• Only write-back is feasible, given disk write latency
118
Sources of Misses
119
§5.6 Virtual Machines
Virtual Machines
120
Virtual Machine Monitor
121
Example: Timer Virtualization
122
Instruction Set Support
123
§5.7 Using a Finite State Machine to Control A Simple Cache
Cache Control
• Example cache characteristics
• Direct-mapped, write-back, write allocate
• Block size: 4 words (16 bytes)
• Cache size: 16 KB (1024 blocks)
• 32-bit byte addresses
• Valid bit and dirty bit per block
• Blocking cache
• CPU waits until access is complete
31 10 9 4 3 0
Tag Index Offset
18 bits 10 bits 4 bits
124
Interface Signals
Read/Write Read/Write
Valid Valid
32 32
Address Address
32 Cache 128 Memory
CPU Write Data Write Data
32 128
Read Data Read Data
Ready Ready
Multiple cycles
per access
125
Finite State Machines
• Use an FSM to sequence
control steps
• Set of states, transition on
each clock edge
• State values are binary
encoded
• Current state stored in a
register
• Next state
= fn (current state,
current inputs)
• Control output signals
= fo (current state)
126
Cache Controller FSM
Could
partition into
separate
states to
reduce clock
cycle time
127
2-Level TLB Organization
Intel Nehalem AMD Opteron X4
Virtual addr 48 bits 48 bits
Physical addr 44 bits 48 bits
Page size 4KB, 2/4MB 4KB, 2/4MB
L1 TLB L1 I-TLB: 128 entries for small L1 I-TLB: 48 entries
(per core) pages, 7 per thread (2×) for L1 D-TLB: 48 entries
large pages Both fully associative, LRU
L1 D-TLB: 64 entries for small replacement
pages, 32 for large pages
Both 4-way, LRU replacement
L2 TLB Single L2 TLB: 512 entries L2 I-TLB: 512 entries
(per core) 4-way, LRU replacement L2 D-TLB: 512 entries
Both 4-way, round-robin LRU
TLB misses Handled in hardware Handled in hardware
128
3-Level Cache Organization
Intel Nehalem AMD Opteron X4
L1 caches L1 I-cache: 32KB, 64-byte L1 I-cache: 32KB, 64-byte
(per core) blocks, 4-way, approx LRU blocks, 2-way, LRU
replacement, hit time n/a replacement, hit time 3 cycles
L1 D-cache: 32KB, 64-byte L1 D-cache: 32KB, 64-byte
blocks, 8-way, approx LRU blocks, 2-way, LRU
replacement, write- replacement, write-
back/allocate, hit time n/a back/allocate, hit time 9 cycles
L2 unified 256KB, 64-byte blocks, 8-way, 512KB, 64-byte blocks, 16-way,
cache approx LRU replacement, write- approx LRU replacement, write-
(per core) back/allocate, hit time n/a back/allocate, hit time n/a
L3 unified 8MB, 64-byte blocks, 16-way, 2MB, 64-byte blocks, 32-way,
cache replacement n/a, write- replace block shared by fewest
(shared) back/allocate, hit time n/a cores, write-back/allocate, hit
time 32 cycles
n/a: data not available
129
Mis Penalty Reduction
130
§5.11 Fallacies and Pitfalls
Pitfalls
131
Pitfalls
132
Pitfalls
133
§5.12 Concluding Remarks
Concluding Remarks
134