Chapter V - Large and Fast - Exploiting Memory Hierarchy
Chapter V - Large and Fast - Exploiting Memory Hierarchy
Contents
1 Introduction to Memory Hierarchy 3
2 Memory Technologies 5
6 Virtual Machines 24
6.1 Virtual Machines (VMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Virtual Machine Monitor (VMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Instruction Set Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7 Virtual Memory 25
7.1 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3 Speeding Up Translation: TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.4 Page Faults and Their Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.5 Handling TLB Misses and Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.6 Page Replacement and Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
7.7 Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
10 Cache Coherence 29
10.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.2 Coherence Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.3 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.4 Invalidating Snooping Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.5 Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2
1 Introduction to Memory Hierarchy
The principle of locality states that programs access a relatively small portion of their address space
at any instant of time, just as you accessed a very small portion of the library’s collection. There are two
different types of locality:
• Temporal locality (locality in time): If an item accessed recently, it will tend to be referenced again
soon
• Spatial locality (locality in space): If an item is referenced, items whose addresses are close by will
tend to be referenced soon
By taking advantage of the principle of locality by implementing the memory of a computer as a memory
hierarchy. A memory hierarchy:
• A structure that uses multiple levels of memories (different speeds and sizes) that store everything on
disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM memory(main memory)
• Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory(Cache memory
attached to CPU)
• The distance from the processor increases, the size of the memories and the access time both increase
• The faster memories are more expensive per bit than the slower memories and thus are smaller
• The goal is to present the user with as much memory as is available in the cheapest technology, while
providing access at the speed offered by the fastest memory
3
Memory Hierarchy Levels
Within each level, the unit of information that is present or not is called a block or a line. Usually we
transfer an entire block when we copy something between levels.
Block (or line) The minimum unit of information that can be either present or not present in a cache
Hit: If the data requested by the processor appears in some block in the upper level. The hit ratio =
hits/accesses, is often used as a measure of the performance of the memory hierarchy.
Miss: If the data requested is not found in the upper level, which the lower level in the hierarchy is then
accessed to retrieve the block containing the requested data. The miss ratio (= 1 - hit ratio) is the fraction
of memory accesses not found in the upper level.
4
2 Memory Technologies
Static RAM(SRAM) Technology: Integrated circuits that are memory arrays wit a single access port that
provide either a read or a write; not need to refresh and use 6 - 8 transistors per bit to prevent the information
from being disturbed when read.
Dynamic RAM(DRAM) Technology: The data kept in a cell is stored as a charge in a capacitor; because
DRAMs use only a single transistor per bit of storage, they are much densor and cheaper per bit than SRAM.
As DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must periodically be refreshed.
To refresh the cell, it read its contents and write it back, the charge can be kept for several milliseconds and
DRAMs use a two-level decoding structure and refresh an entire row.
An ideal memory is satisfied access time of SRAM, capacity and cost/GB of disk.
Modern DRAMs(DDR3) are organized in a banks, each bank consists of a series of rows, DRAM can
access an entire row and in burst mode, it suppy successive words from a row with reduced latency.
5
Double data rate (DDR) DRAM: Data transfers on both the rising and falling edge of clock, thereby
getting twice as much bandwith as expect based on the clock rate and the data width. Latest version of this
technology is called (DDR4)
Quad data rate (QDR) DRAM: Separate DDR inputs and outputs.
DRAM Performance Factors
• Row buffer: Allow several words to be read and refreshed in parallel (repeated access)
• Synchronous DRAM(SDRAMs): Allow for consecutive accesses in bursts without needing to send each
address, improve the memory bandwidth of the chip
• DRAM banking: Allow simultaneous access to multiple DRAMs and improve the memory bandwidth
of the chip
Flash Memory: A nonvolatile semiconductor storage which is 100x - 1000x faster than disk but smaller, lower
power, more robust and more $/GB.
Types of flash:
• NOR flash: bit cell like a NOR gate
– Random read/write access
– Used for instruction memory in embedded systems
• NAND flash: bit cell like a NAND gate
– Denser (bits/area), but block-at-a-time access
– Cheaper per GB
– Used for USB keys, media storage, etc.
• Flash bits wear out after 1000’s of accesses
– Not suitable for direct RAM or disk replacement
– Wear leveling: remap data to less used blocks
Disk memory: A nonvolatile semiconductor storage with rotating magnetic storage
6
• Seek: move the heads
• Rotational latency
• Data transfer
• Controller overhead
Some disk performance issues:
• Manufacturers quote average seek time
– Based on all possible seeks
– Locality and OS scheduling lead to smaller actual average seek times
• Smart disk controller allocates physical sectors on disk
– Present logical sector interface to host
– SCSI, ATA, SATA
• Before the request, the cache contains a collection of recent references X1 , X2 , ..., Xn−1 ,and the processor
requests a word Xn that is not in the cache; this request results in a miss, and the word Xn is brought
from memory into the cache
Figure 5: The cache before and after a reference to a word Xn that is not initially in the cache
7
Figure 6: A direct-mapped cache with eight blocks showing the addresses of memory words between 0 and
31 that map to the same cache locations
• The tag needs only to contain the upper portion of the address, corresponding to the bits that are not
used as an index into the cache
• Called the tag
What if there is no data in a location?
Example: Giving eight blocks in the cache, the low-order three bits of an address give the block number
8
For the following situation:
• 32-bit addresses
• A direct-mapped cache
• The cache size is 2n blocks, so n bits are used for the index
• The block size is 2m words (2m+2 bytes), so m bits are used for the word within the block, and two
bits are used for the byte part of the address
⇒ The size of the tag field is:
32 − (n + m + 2)
⇒ The total number of bits in a direct-mapped cache is:
9
Figure 7: 4 KiB cache
Example: How many total bits are required for a direct-mapped cache with 16 KiB of data and 4-word blocks,
assuming a 32-bit address?
We know that 16 KiB is 4096 (212 ) words. With a block size of 4 words (22 ), there are 1024 (210 ) blocks.
Each block has 4 × 32 = 128 bits of data plus a tag, which is 32 − 10 − 2 − 2 bits, plus a valid bit. Thus, the
total cache size is
Example: Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte
address 1200 map?
10
End address = 75 × 16 + 16 − 1 = 1215
⇒ This block maps all addresses between 1200 and 1215.
11
Write-back
In a write-back scheme:
• On a data-write hit, just update the cache, not the memory
• The cache keeps track of which blocks are "dirty" (have been modified in cache but not yet written to
memory)
When a dirty block needs to be replaced
• If the cache needs to replace a dirty block (for example, to make room for new data), then it writes the
dirty block back to main memory
• Saves many memory writes compared to write-through, since only modified blocks are written back,
and only when necessary
• A write buffer can be used so the CPU doesn’t have to wait for the write to finish, especially if the
cache needs to read new data into the same location
Write allocation
A write miss occurs when the CPU wants to write data to a memory location, but that location is not
currently in the cache. There are two alternatives:
• For Write-Through Caches:
– Allocate on Miss (Fetch the Block)
– Write Around (Don’t Fetch the Block)
• For Write-Back Caches:
– Usually Fetch the Block (Write Allocate)
Intrinsity FastMATH
• 12-stage pipeline
• Instruction and data access on each cycle
12
Split Cache: Separate I-cache and D-cache
• I-cache: 0.4%
• D-cache: 11.4%
• Weighted average: 3.2%
• Fixed width: Each data transfer on the bus moves a fixed number of bytes (often 1 word, where a word
is typically 4 bytes for a 32-bit system)
• Connected by fixed-width clocked bus which running at its own clock speed, usually slower than the
CPU clock
Example: Cache Block Read
• 1 bus cycle for address transfer: The CPU sends the address of the desired data to the memory-this
takes one bus cycle
• 15 bus cycles per DRAM access: For each word read from DRAM, there’s a delay of 15 cycles (this is
the DRAM access time)
• 1 bus cycle per data transfer: After the DRAM supplies a word, it takes 1 bus cycle to transfer that
word from memory to the cache
Example: Reading a 4-Word Block with 1-word-wide DRAM To fetch a 4-word block:
• Address transfer: 1 cycle (to tell memory what to fetch)
CPU time = (CPU execution clock cycles + Memory-stall clock cycles) × Clock cycle time
The program execution cycles includes cache hit time and the memory-stall clock cycles come primarily from
cache misses. Memory-stall clock cycles can be defined as the sum of the stall cycles coming from reads plus
those coming from writes:
13
If we assume that the write buffer stalls are negligible, we can combine the reads and writes by using a single
miss rate and the miss penalty:
Memory accesses
Memory-stall clock cycles = × Miss rate × Miss penalty
Program
Instructions Misses
Memory-stall clock cycles = × × Miss penalty
Program Instruction
The read-stall cycles can be defined in terms of the number of read accesses per program, the miss penalty
in clock cycles for a read, and the read miss rate:
Reads
Read-stall cycles = × Read miss rate × Read miss penalty
Program
To capture the fact that the time to access data for both hits and misses aff ects performance, designers
sometime use average memory access time (AMAT) as a way to examine alternative cache designs:
Example: Assume the miss rate of an instruction cache is 2% and the miss rate of the data cache is 4%. If a
processor has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine
how much faster a processor would run with a perfect cache that never missed. Assume the frequency of all
loads and stores is 36%.
The number of memory miss cycles for instructions in terms of the Instruction count (I) is
This is more than three cycles of memory stall per instruction. Accordingly, the total CPI including memory
stalls is
2 + 3.44 = 5.44.
Since there is no change in instruction count or clock rate, the ratio of the CPU execution times is
CPU time with stalls I × CPIstall × Clock cycle CPIstall 5.44
= = =
CPU time with perfect cache I × CPIperfect × Clock cycle CPIperfect 2
The performance with the perfect cache is better by
5.44
= 2.72.
2
Example: Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of 20 clock cycles,
a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle.
Assume that the read and write miss penalties are the same and ignore other write stalls.
AMAT = Time for a hit + Miss rate × Miss penalty = 1 + 0.05 × 20 = 2 clock cycles
or 2 ns.
14
4.2 Performance Summary
• When CPU performance increased
– Miss penalty becomes more significant
• Decreasing base CPI
– Greater proportion of time spent on memory stalls
Example:
Figure 9: The location of a memory block whose address is 12 in a cache with eight blocks varies for direct-
mapped, set-associative, and fully associative placement
15
Spectrum of associativity
The total size of the cache in blocks is equal to the number of sets times the associativity. Thus, for a fixed
cache size, increasing the associativity decreases the number of sets while increasing the number of elements
per set. With eight blocks, an eight-way set associative cache is the same as a fully associative cache.
Example: Assume there are three small caches, each consisting of four one-word blocks. One cache is
fully associative, a second is two-way set-associative, and the third is direct-mapped. Find the number of
misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, and 8.
16
Figure 12: 2 way set associative
17
Figure 15: Set Associative Cache Organization
• Set associative
– Prefer non-valid entry, if there is one
– Otherwise, choose among entries in the set
• Least-recently used (LRU)
Example: Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a main memory access time of 100 ns, including all the miss
handling. Suppose the miss rate per instruction at the primary cache is 2%. How much faster will the
18
processor be if we add a secondary cache that has a 5 ns access time for either a hit or a miss and is large
enough to reduce the miss rate to main memory to 0.5%?
If the miss is satisfied in the secondary cache, then this is the entire miss penalty. If the miss needs to
go to main memory, then the total miss penalty is the sum of the secondary cache access time and the main
memory access time.
Thus, for a two-level cache, total CPI is the sum of the stall cycles from both levels of cache and the base
CPI:
Total CPI = 1 + Primary stalls per instruction + Secondary stalls per instruction
= 1 + 2% × 20 + 0.5% × 400
= 1 + 0.4 + 2.0 = 3.4
Thus, the processor with the secondary cache is faster by
9.0
= 2.6
3.4
19
4.8 Interactions with Software
Misses depend on memory access patterns:
• Algorithm behavior
• Compiler optimization for memory access
Example: Compare between Quicksort and Radix Sort
Figure 16: Comparing Quicksort and Radix Sort by (a) instructions executed per item sorted, (b) time per
item sorted, and (c) cache misses per item sorted
20
for (int j = 0; j < n; ++j)
{
double cij = C[i+j*n]; /* cij = C[i][j] */
for( int k = 0; k < n; k++ )
cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
C[i+j*n] = cij; /* C[i][j] = cij */
}
}
It reads all N-by-N elements of B, reads the same N elements in what corresponds to one row of A repeatedly,
and writes what corresponds to one row of N elements of C.
21
Figure 18: The age of accesses to the arrays C, A, and B when BLOCKSIZE 3
Transitions from state 1 to state 2 are caused by failures, and transitions from state 2 to state 1 are called
restorations.
Failures can be permanent or intermittent and Fault is failure of a component that may or may not lead to
system failure
22
5.2 The Hamming SEC Code
• The Hamming distance between two binary bit patterns is the number of positions where the bits are
different. Example: 1011 vs 1110 → Hamming distance is 2 (they differ in positions 2 and 4).
• Minimum Distance = 2 can provide a single bit error detection
• Minimum Distance = 3 can provide single error correction and 2-bit error detection
• Step 5: Set parity bits to create even parity for each group.
Figure 20: Parity bits, data bits, and fi eld coverage in a Hamming ECC code for eight data bits
Example: Assume one byte data value is 10011010two. First show the Hamming ECC code for that byte,
and then invert bit 10 and show that the ECC code fi nds and corrects the single bit error.
Leaving spaces for the parity bits, the 12-bit pattern is:
__1_001_1010
Position 1 checks bits 1, 3, 5, 7, 9, and 11, which we highlight:
__ 1 _ 0 0 1 _ 1 0 1
There are an odd number of 1s (i.e., 3), so to make it even parity, we set bit 1 to 0.
Position 2 checks bits 2, 3, 6, 7, 10, 11:
0 _1 _ 0 0 1 _ 1 0 1 0
Again, odd parity, so we set bit 2 to 1.
23
Position 4 checks bits 4, 5, 6, 7, 12:
0 1 1_ 0 0 1 _ 1 0 1
Odd again, so we set bit 4 to 1.
Position 8 checks bits 8, 9, 10, 11, 12:
0111001_1010
It contains an odd number of 1s, so we set bit 8 to 0.
011100101010
Inverting bit 10 changes it to:
011100101110
Parity bit 1 is 0 (011100101110 has four 1s in its group, so even parity; this group is OK).
Parity bit 2 is 1 (011100101110 has five 1s in its group, so odd parity; there is an error somewhere).
Parity bit 4 is 1 (011100101110 has two 1s in its group, so even parity; this group is OK).
Parity bit 8 is 1 (011100101110 has three 1s in its group, so odd parity; there is an error somewhere).
Parity bits 2 and 8 are incorrect. As 2 + 8 = 10, bit 10 must be wrong. Hence, we can correct the error
by inverting bit 10:
011100101010
Note: ECC DRAM uses SEC/DEC with 8 bits protecting each 64 bits.
6 Virtual Machines
6.1 Virtual Machines (VMs)
• It allows one computer (host) to emulate another (guest).
• Multiple guest systems can run on a single host - safely and independently.
• Benefits:
– Isolation: One VM crashing doesn’t affect others.
– Security: Safer testing environment for new software or malware.
– Resource Sharing: One machine can be used more efficiently.
24
6.2 Virtual Machine Monitor (VMM)
• It manages virtual machines and maps virtual resources to the real physical hardware.
• Guest OSes run in user mode, and switch to the VMM on privileged actions (like I/O or interrupt
control).
• The VMM handles:
7 Virtual Memory
7.1 Virtual Memory
• Virtual memory allows programs to use more memory than what’s physically available (RAM).
• It gives each program its own private memory space.
• The OS and CPU work together to move data between fast RAM and slow disk storage.
25
7.3 Speeding Up Translation: TLB
• TLB (Translation Lookaside Buffer) is a small, fast memory inside the CPU that stores recent
page table lookups.
• A TLB hit means quick access; a miss means the CPU checks the page table (slower).
• The OS loads the page from disk into RAM - this takes millions of cycles.
• Page faults should be rare for good performance.
• Pages changed in RAM are marked dirty and must be saved to disk.
• To save time, only dirty pages are written back.
26
8 A common framework for Memory Hierachies
8.1 Memory Hierarchy
• The computer memory is organized in levels: registers, cache, main memory (RAM), and disk.
• Lower levels (e.g: disk) are slower but larger, upper levels (e.g: registers) are faster but smaller.
• Each level stores data from the level below to make access faster - this is called caching.
• More associativity = fewer misses, but more hardware complexity and time.
• Cache uses tags to search for blocks quickly.
8.3 Replacement
• When the cache is full, it must decide which block to replace.
• Common strategies:
– Least Recently Used (LRU): Replace the block that hasn’t been used for the longest time.
– Random: Pick a block randomly (simpler, almost as good as LRU).
• Virtual memory systems also use similar ideas to choose which pages to remove from RAM.
• In virtual memory, only write-back is practical because writing to disk is very slow.
27
9 Simple Cache control
9.1 Cache Setup
• A cache is a small, fast memory that stores recently accessed data from the larger main memory, to
speed up future access.
• Let’s look at a common cache configuration:
– Direct-mapped: Each block of memory can go in only one place in the cache.
– Write-back: When the CPU changes data, it is only written to the cache. It will update the
main memory later when the cache block is replaced.
– Write-allocate: On a write miss, the cache loads the block from memory first, then writes to it.
• Example specs:
– Block size: 4 words = 16 bytes per block.
– Cache size: 16 KB = 1024 blocks.
– Addresses: 32-bit addresses are used to access memory.
• Each cache block has:
– A valid bit - tells if the block contains real data.
– A dirty bit - tells if the block was modified (used for write-back).
• This is a blocking cache, meaning the CPU waits (stalls) until the cache access is done.
28
10 Cache Coherence
10.1 Cache Coherence
• In multi-core CPUs, each core has its own cache.
• All cores share the same main memory and might access the same data.
• If one core updates a value, but others still use old cached values, this causes inconsistency.
• Cache coherence ensures all cores see the most recent value of shared data.
• Example:
– Snooping: Caches "listen" to a shared bus and watch for other reads/writes.
– Directory-based: A central directory tracks which caches have copies of each data block.
• Goals:
• If another CPU wants the value later, it fetches the new one (causing a cache miss).
29
10.5 Memory Consistency
• While coherence keeps values up-to-date, consistency controls when those updates are visible to other
CPUs.
• Example rule:
– If a CPU writes to X, then Y, other CPUs must see the new value of X before they see the new Y.
• CPUs may reorder reads for performance, but must not reorder writes.
• Memory consistency is especially important for programs where the order of operations matters.
30
12 RISC-V and special instructions
• RISC-V provides special instructions for managing system-level tasks, memory ordering, and commu-
nication between user programs and the operating system.
• These are essential for OS design, system-level programming, and efficient processor control.
– ecall - Makes a system call (used to request services from the OS).
– ebreak - Triggers a breakpoint for debugging.
– sret - Returns from an exception in supervisor mode.
– wfi - Waits for an interrupt; puts CPU in low-power mode.
31
13.2 Cache Blocking
• Matrix multiplication needs a lot of memory access.
• Cache is much faster than main memory.
• Cache blocking breaks large matrices into smaller blocks (tiles) that fit in cache.
• This reduces cache misses and improves speed.
14.2 Pitfalls
• Ignoring memory hierarchy when coding or compiling
⇒ Memory-unaware programs may run much slower. Optimizing for cache (e.g., loop blocking) can
double performance.
• Forgetting byte addressing or block size in cache simulation
⇒ A common mistake in exercises. Always break down addresses into block index and offset correctly.
32
• Using too-low associativity in shared caches
⇒ With more threads than cache ways, accidental conflicts can occur. This causes performance drops
in multi-core systems.
• Relying on average memory access time for out-of-order processors
⇒ These CPUs run multiple instructions in parallel, so AMAT doesn’t fully capture actual performance.
Use simulation instead.
• Expanding memory using visible segments instead of flat space
⇒ Segment-based addressing complicates pointers and arrays. Flat memory is more scalable and easier
to program.
33