KEMBAR78
Chapter V - Large and Fast - Exploiting Memory Hierarchy | PDF | Random Access Memory | Dynamic Random Access Memory
0% found this document useful (0 votes)
20 views33 pages

Chapter V - Large and Fast - Exploiting Memory Hierarchy

This document is a comprehensive overview of memory hierarchy in computer architecture, covering topics such as cache memory, performance measurement, and virtual memory. It includes detailed sections on cache types, memory technologies, and dependable memory systems, along with practical examples from ARM Cortex A53 and Intel Core i7 architectures. The content is structured into chapters and subsections, providing a thorough exploration of memory management techniques and their implications for system performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views33 pages

Chapter V - Large and Fast - Exploiting Memory Hierarchy

This document is a comprehensive overview of memory hierarchy in computer architecture, covering topics such as cache memory, performance measurement, and virtual memory. It includes detailed sections on cache types, memory technologies, and dependable memory systems, along with practical examples from ARM Cortex A53 and Intel Core i7 architectures. The content is structured into chapters and subsections, providing a thorough exploration of memory management techniques and their implications for system performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Computer Architecture - Chapter V

Large and Fast: Exploiting Memory Hierarchy


By Huynh Bao & Quynh Nhi from USTH Learning Support
June 2025

Contents
1 Introduction to Memory Hierarchy 3

2 Memory Technologies 5

3 The Basics of Caches 7


3.1 Cache Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Direct Mapped Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Tags and Valid Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Block Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Handling Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7 Main Memory Supporting Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Measuring and Improving Cache Performance 13


4.1 Measuring Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Performance Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.5 Multilevel Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.6 Multilevel Cache Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.7 Interactions with Advanced CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.8 Interactions with Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.9 Software Optimization via Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Dependable memory hierarchy 22


5.1 Defining Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 The Hamming SEC Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Encoding SEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.4 Decoding SEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5 SEC/DEC Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Virtual Machines 24
6.1 Virtual Machines (VMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Virtual Machine Monitor (VMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 Instruction Set Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Virtual Memory 25
7.1 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.3 Speeding Up Translation: TLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.4 Page Faults and Their Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.5 Handling TLB Misses and Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.6 Page Replacement and Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1
7.7 Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8 A common framework for Memory Hierachies 27


8.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.2 Block Placement & Finding a Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.3 Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.4 Write Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.5 Sources of Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

9 Simple Cache control 28


9.1 Cache Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.2 Interface Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9.3 Controlling the Cache with Finite State Machines (FSM) . . . . . . . . . . . . . . . . . . . . 28

10 Cache Coherence 29
10.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.2 Coherence Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.3 Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.4 Invalidating Snooping Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
10.5 Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

11 ARM Cortex A53 and Intel Core i7 Memory hierarchies 30


11.1 Multilevel On-Chip Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
11.2 2-Level TLB Organization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
11.3 Supporting Multiple Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

12 RISC-V and special instructions 31


12.1 Memory and Instruction Fences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
12.2 CSR (Control and Status Register) Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 31
12.3 System Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

13 Cache blocking and matrix multiply 31


13.1 DGEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
13.2 Cache Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
13.3 Subword Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
13.4 Combining Both for Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

14 Fallacies and Pitfalls 32


14.1 Fallacies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
14.2 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2
1 Introduction to Memory Hierarchy
The principle of locality states that programs access a relatively small portion of their address space
at any instant of time, just as you accessed a very small portion of the library’s collection. There are two
different types of locality:
• Temporal locality (locality in time): If an item accessed recently, it will tend to be referenced again
soon
• Spatial locality (locality in space): If an item is referenced, items whose addresses are close by will
tend to be referenced soon

Figure 1: The basic structure of a memory hierarchy

By taking advantage of the principle of locality by implementing the memory of a computer as a memory
hierarchy. A memory hierarchy:
• A structure that uses multiple levels of memories (different speeds and sizes) that store everything on
disk
• Copy recently accessed (and nearby) items from disk to smaller DRAM memory(main memory)
• Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory(Cache memory
attached to CPU)

• The distance from the processor increases, the size of the memories and the access time both increase
• The faster memories are more expensive per bit than the slower memories and thus are smaller
• The goal is to present the user with as much memory as is available in the cheapest technology, while
providing access at the speed offered by the fastest memory

3
Memory Hierarchy Levels

Within each level, the unit of information that is present or not is called a block or a line. Usually we
transfer an entire block when we copy something between levels.

Block (or line) The minimum unit of information that can be either present or not present in a cache

Hit: If the data requested by the processor appears in some block in the upper level. The hit ratio =
hits/accesses, is often used as a measure of the performance of the memory hierarchy.

Miss: If the data requested is not found in the upper level, which the lower level in the hierarchy is then
accessed to retrieve the block containing the requested data. The miss ratio (= 1 - hit ratio) is the fraction
of memory accesses not found in the upper level.

Figure 2: The structure of a memory hierarchy

4
2 Memory Technologies
Static RAM(SRAM) Technology: Integrated circuits that are memory arrays wit a single access port that
provide either a read or a write; not need to refresh and use 6 - 8 transistors per bit to prevent the information
from being disturbed when read.

Dynamic RAM(DRAM) Technology: The data kept in a cell is stored as a charge in a capacitor; because
DRAMs use only a single transistor per bit of storage, they are much densor and cheaper per bit than SRAM.
As DRAMs store the charge on a capacitor, it cannot be kept indefinitely and must periodically be refreshed.
To refresh the cell, it read its contents and write it back, the charge can be kept for several milliseconds and
DRAMs use a two-level decoding structure and refresh an entire row.

An ideal memory is satisfied access time of SRAM, capacity and cost/GB of disk.

Figure 3: Internal organization of a DRAM

Figure 4: DRAM generations

Modern DRAMs(DDR3) are organized in a banks, each bank consists of a series of rows, DRAM can
access an entire row and in burst mode, it suppy successive words from a row with reduced latency.

5
Double data rate (DDR) DRAM: Data transfers on both the rising and falling edge of clock, thereby
getting twice as much bandwith as expect based on the clock rate and the data width. Latest version of this
technology is called (DDR4)
Quad data rate (QDR) DRAM: Separate DDR inputs and outputs.
DRAM Performance Factors
• Row buffer: Allow several words to be read and refreshed in parallel (repeated access)
• Synchronous DRAM(SDRAMs): Allow for consecutive accesses in bursts without needing to send each
address, improve the memory bandwidth of the chip
• DRAM banking: Allow simultaneous access to multiple DRAMs and improve the memory bandwidth
of the chip
Flash Memory: A nonvolatile semiconductor storage which is 100x - 1000x faster than disk but smaller, lower
power, more robust and more $/GB.
Types of flash:
• NOR flash: bit cell like a NOR gate
– Random read/write access
– Used for instruction memory in embedded systems
• NAND flash: bit cell like a NAND gate
– Denser (bits/area), but block-at-a-time access
– Cheaper per GB
– Used for USB keys, media storage, etc.
• Flash bits wear out after 1000’s of accesses
– Not suitable for direct RAM or disk replacement
– Wear leveling: remap data to less used blocks
Disk memory: A nonvolatile semiconductor storage with rotating magnetic storage

Each sector records:


• Sector ID
• Data (512 bytes, 4096 bytes proposed)
• Error correcting code (ECC) that used to hide defects and recording errors
• Synchronization fields and gaps
Access to a sector involves:
• Queuing delay if other accesses are pending

6
• Seek: move the heads
• Rotational latency
• Data transfer

• Controller overhead
Some disk performance issues:
• Manufacturers quote average seek time
– Based on all possible seeks
– Locality and OS scheduling lead to smaller actual average seek times
• Smart disk controller allocates physical sectors on disk
– Present logical sector interface to host
– SCSI, ATA, SATA

• Disk drives include caches


– Prefetch sectors in anticipation of access
– Avoid seek and rotational delay

3 The Basics of Caches


3.1 Cache Memory
• Represent the level of the memory hierarchy between the processor and main memory in the first
commercial computer to have this extra level

• Before the request, the cache contains a collection of recent references X1 , X2 , ..., Xn−1 ,and the processor
requests a word Xn that is not in the cache; this request results in a miss, and the word Xn is brought
from memory into the cache

Figure 5: The cache before and after a reference to a word Xn that is not initially in the cache

7
Figure 6: A direct-mapped cache with eight blocks showing the addresses of memory words between 0 and
31 that map to the same cache locations

3.2 Direct Mapped Cache


• The simplest way to assign a location in the cache for each word in memory is to assign the cache
location based on the address of the word in memory (which is called direct mapped).

• Almost all direct-mapped caches use this mapping to find a block:

(Block address) modulo (Number of blocks in the cache)

• The number of blocks in the cache is a power of 2


• Using the low-order log2 (cache size in blocks) bits of the address.

3.3 Tags and Valid Bits


How do we know which particular block is stored in a cache location?
• Adding a set of tags (store the block address ) to the cache.

• The tag needs only to contain the upper portion of the address, corresponding to the bits that are not
used as an index into the cache
• Called the tag
What if there is no data in a location?

• Add a valid bit to indicate whether an entry contains a valid address


• Valid bit: 1 = present, 0 = not present
• Initially 0

Example: Giving eight blocks in the cache, the low-order three bits of an address give the block number

8
For the following situation:

• 32-bit addresses
• A direct-mapped cache
• The cache size is 2n blocks, so n bits are used for the index

• The block size is 2m words (2m+2 bytes), so m bits are used for the word within the block, and two
bits are used for the byte part of the address
⇒ The size of the tag field is:
32 − (n + m + 2)
⇒ The total number of bits in a direct-mapped cache is:

2n × (block size + tag size + valid field size)

9
Figure 7: 4 KiB cache

Example: How many total bits are required for a direct-mapped cache with 16 KiB of data and 4-word blocks,
assuming a 32-bit address?

We know that 16 KiB is 4096 (212 ) words. With a block size of 4 words (22 ), there are 1024 (210 ) blocks.
Each block has 4 × 32 = 128 bits of data plus a tag, which is 32 − 10 − 2 − 2 bits, plus a valid bit. Thus, the
total cache size is

210 × (4 × 32 + (32 − 10 − 2 − 2) + 1) = 210 × 147 = 147 Kibibits


or 18.4 KiB for a 16 KiB cache. For this cache, the total number of bits in the cache is about 1.15 times
as many as needed just for the storage of the data.

Example: Consider a cache with 64 blocks and a block size of 16 bytes. To what block number does byte
address 1200 map?

The block is given by

(Block address) modulo (Number of blocks in the cache)

where the address of the block is


Byte address
Bytes per block
Notice that this block address is the block containing all addresses between
Byte address
× Bytes per block
Bytes per block
and
Byte address
× Bytes per block + (Bytes per block − 1)
Bytes per block
Thus, with 16 bytes per block, byte address 1200 is block address
1200
= 75
16
which maps to cache block number 75 mod 64 = 11.

Start address = 75 × 16 = 1200

10
End address = 75 × 16 + 16 − 1 = 1215
⇒ This block maps all addresses between 1200 and 1215.

3.4 Block Size Considerations


• Larger blocks exploit spatial locality to lower miss rates
• In a fixed-sized cache:
– Increasing the block size usually decreases the miss rate
– The number of blocks that can be held in the cache will become small, and there will be a great
deal of competition for those blocks
• The miss penalty is determined by the time required to fetch the block from the next lower level of the
hierarchy and load it into the cache
– the increase in the miss penalty overwhelms the decrease in the miss rate for blocks that are too
large
– cache performance thus decreases
– requested word first or critical word first and early restart can help it works better

3.5 Cache Misses


Cache miss is a request for data from the cache that cannot be filled because the data is not present in the
cache.
• On cache hit, CPU proceeds normally
• On cache miss
– Stall the CPU pipeline
– Fetch block from next level of hierarchy
– Instruction cache miss may cause the CPU restarts the instruction fetch
– Data cache miss will cause the CPU have to wait for the data block to be fetched before the data
access can complete

3.6 Handling Write


Write-through
When the CPU writes data (a store instruction) and that data is in the cache (data-write hit), there are two
choices:
• Only update the cache (fast, but cache and memory may become inconsistent)
• Write-through: Update both the cache and the main memory at the same time, so they always match
Problems with write-through:
• Slower Writes: Every store by the CPU causes a memory write. Since writing to memory is slow (for
example, 100 cycles), this can significantly slow down the system
• Example: If 10% of instructions are stores and each write to memory takes 100 cycles, the Effective
CPI = 1 + 0.1 × 100 = 11. This means the CPU spends a lot more time waiting for writes to finish
Solution is using write buffer:
• A queue that holds data while the data is waiting to be written to memory
• The CPU puts the data in the write buffer and continues right away ⇒ it doesn’t have to wait for the
slow memory write
• The only time the CPU stalls is if the write buffer is full

11
Write-back
In a write-back scheme:
• On a data-write hit, just update the cache, not the memory
• The cache keeps track of which blocks are "dirty" (have been modified in cache but not yet written to
memory)
When a dirty block needs to be replaced
• If the cache needs to replace a dirty block (for example, to make room for new data), then it writes the
dirty block back to main memory
• Saves many memory writes compared to write-through, since only modified blocks are written back,
and only when necessary
• A write buffer can be used so the CPU doesn’t have to wait for the write to finish, especially if the
cache needs to read new data into the same location

Write allocation
A write miss occurs when the CPU wants to write data to a memory location, but that location is not
currently in the cache. There are two alternatives:
• For Write-Through Caches:
– Allocate on Miss (Fetch the Block)
– Write Around (Don’t Fetch the Block)
• For Write-Back Caches:
– Usually Fetch the Block (Write Allocate)

Intrinsity FastMATH

Figure 8: The 16 KiB caches in the Intrinsity FastMATH

Embedded MIPS Processor

• 12-stage pipeline
• Instruction and data access on each cycle

12
Split Cache: Separate I-cache and D-cache

• Each 16KB: 256 blocks × 16 words/block


• D-cache: write-through or write-back

SPEC2000 Miss Rates

• I-cache: 0.4%
• D-cache: 11.4%
• Weighted average: 3.2%

3.7 Main Memory Supporting Caches


Use DRAMs for Main Memory:
• DRAM is commonly used for a computer’s main memory

• Fixed width: Each data transfer on the bus moves a fixed number of bytes (often 1 word, where a word
is typically 4 bytes for a 32-bit system)
• Connected by fixed-width clocked bus which running at its own clock speed, usually slower than the
CPU clock
Example: Cache Block Read

• 1 bus cycle for address transfer: The CPU sends the address of the desired data to the memory-this
takes one bus cycle
• 15 bus cycles per DRAM access: For each word read from DRAM, there’s a delay of 15 cycles (this is
the DRAM access time)

• 1 bus cycle per data transfer: After the DRAM supplies a word, it takes 1 bus cycle to transfer that
word from memory to the cache
Example: Reading a 4-Word Block with 1-word-wide DRAM To fetch a 4-word block:
• Address transfer: 1 cycle (to tell memory what to fetch)

• DRAM access: For each of the 4 words, takes 15 cycles = 4 × 15 = 60 cycles


• Data transfer: For each word, 1 cycle = 4 × 1 = 4 cycles
⇒ Total miss penalty: 1 + 4 × 15 + 4 × 1 = 65 bus cycles For a 4-word block, 16 bytes are transferred and it
takes 65 bus cycles to fetch those 16 bytes
⇒ Bandwidth = 16 bytes/65 cycles ≈ 0.25 B/cycles

4 Measuring and Improving Cache Performance


4.1 Measuring Cache Performance
CPU time can be divided into the clock cycles that the CPU spends executing the program and the clock
cycles that the CPU spends waiting for the memory system:

CPU time = (CPU execution clock cycles + Memory-stall clock cycles) × Clock cycle time

The program execution cycles includes cache hit time and the memory-stall clock cycles come primarily from
cache misses. Memory-stall clock cycles can be defined as the sum of the stall cycles coming from reads plus
those coming from writes:

Memory-stall clock cycles = (Read-stall cycles + Write-stall cycles)

13
If we assume that the write buffer stalls are negligible, we can combine the reads and writes by using a single
miss rate and the miss penalty:
Memory accesses
Memory-stall clock cycles = × Miss rate × Miss penalty
Program
Instructions Misses
Memory-stall clock cycles = × × Miss penalty
Program Instruction
The read-stall cycles can be defined in terms of the number of read accesses per program, the miss penalty
in clock cycles for a read, and the read miss rate:
Reads
Read-stall cycles = × Read miss rate × Read miss penalty
Program
To capture the fact that the time to access data for both hits and misses aff ects performance, designers
sometime use average memory access time (AMAT) as a way to examine alternative cache designs:

AMAT = Time for a hit + Miss rate × Miss penalty

Example: Assume the miss rate of an instruction cache is 2% and the miss rate of the data cache is 4%. If a

processor has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine
how much faster a processor would run with a perfect cache that never missed. Assume the frequency of all
loads and stores is 36%.

The number of memory miss cycles for instructions in terms of the Instruction count (I) is

Instruction miss cycles = I × 2% × 100 = 2.00 × I


As the frequency of all loads and stores is 36%, we can find the number of memory miss cycles for data
references:

Data miss cycles = I × 36% × 4% × 100 = 1.44 × I


The total number of memory-stall cycles is

2.00I + 1.44I = 3.44I.

This is more than three cycles of memory stall per instruction. Accordingly, the total CPI including memory
stalls is
2 + 3.44 = 5.44.
Since there is no change in instruction count or clock rate, the ratio of the CPU execution times is
CPU time with stalls I × CPIstall × Clock cycle CPIstall 5.44
= = =
CPU time with perfect cache I × CPIperfect × Clock cycle CPIperfect 2
The performance with the perfect cache is better by
5.44
= 2.72.
2

Example: Find the AMAT for a processor with a 1 ns clock cycle time, a miss penalty of 20 clock cycles,
a miss rate of 0.05 misses per instruction, and a cache access time (including hit detection) of 1 clock cycle.
Assume that the read and write miss penalties are the same and ignore other write stalls.

The average memory access time per instruction is

AMAT = Time for a hit + Miss rate × Miss penalty = 1 + 0.05 × 20 = 2 clock cycles
or 2 ns.

14
4.2 Performance Summary
• When CPU performance increased
– Miss penalty becomes more significant
• Decreasing base CPI
– Greater proportion of time spent on memory stalls

• Increasing clock rate


– Memory stalls account for more CPU cycles
• Can’t neglect cache behavior when evaluating system performance

4.3 Associative Caches


• Fully associative
– Allow a given block to go in any cache entry
– Requires all entries to be searched at once
– Comparator per entry (expensive)
• n-way set associative
– Each set contains n entries
– Block number determines which set:

(Block number) mod (#Sets in cache)

– Search all entries in a given set at once


– n comparators (less expensive)

Example:

Figure 9: The location of a memory block whose address is 12 in a cache with eight blocks varies for direct-
mapped, set-associative, and fully associative placement

15
Spectrum of associativity

Figure 10: A cache with 8 entries

The total size of the cache in blocks is equal to the number of sets times the associativity. Thus, for a fixed
cache size, increasing the associativity decreases the number of sets while increasing the number of elements
per set. With eight blocks, an eight-way set associative cache is the same as a fully associative cache.
Example: Assume there are three small caches, each consisting of four one-word blocks. One cache is
fully associative, a second is two-way set-associative, and the third is direct-mapped. Find the number of
misses for each cache organization given the following sequence of block addresses: 0, 8, 0, 6, and 8.

Figure 11: Direct mapped

16
Figure 12: 2 way set associative

Figure 13: Fully associative

⇒ Increased associativity decreases miss rate but with diminishing returns


Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000:

Figure 14: The data cache miss rates

17
Figure 15: Set Associative Cache Organization

4.4 Replacement Policy


• Direct mapped: no choice

• Set associative
– Prefer non-valid entry, if there is one
– Otherwise, choose among entries in the set
• Least-recently used (LRU)

– Choose the one unused for the longest time


– Simple for 2-way, manageable for 4-way, too hard beyond that
• Random
– Gives approximately the same performance as LRU for high associativity

4.5 Multilevel Caches


• Primary cache attached to CPU

– Small, but fast


• Level-2 cache services misses from primary cache
– Larger, slower, but still faster than main memory
• Main memory services L-2 cache misses

• Some high-end systems include L-3 cache

Example: Suppose we have a processor with a base CPI of 1.0, assuming all references hit in the primary
cache, and a clock rate of 4 GHz. Assume a main memory access time of 100 ns, including all the miss
handling. Suppose the miss rate per instruction at the primary cache is 2%. How much faster will the

18
processor be if we add a secondary cache that has a 5 ns access time for either a hit or a miss and is large
enough to reduce the miss rate to main memory to 0.5%?

The miss penalty to main memory is


100 ns
= 400 clock cycles
0.25 clocknscycle
The effective CPI with one level of caching is given by
Total CPI = Base CPI + Memory-stall cycles per instruction
For the processor with one level of caching,
Total CPI = 1.0 + Memory-stall cycles per instruction = 1.0 + 2% × 400 = 9
With two levels of caching, a miss in the primary (or first-level) cache can be satisfied either by the
secondary cache or by main memory. The miss penalty for an access to the second-level cache is
5 ns
ns = 20 clock cycles
0.25 clock cycle

If the miss is satisfied in the secondary cache, then this is the entire miss penalty. If the miss needs to
go to main memory, then the total miss penalty is the sum of the secondary cache access time and the main
memory access time.
Thus, for a two-level cache, total CPI is the sum of the stall cycles from both levels of cache and the base
CPI:
Total CPI = 1 + Primary stalls per instruction + Secondary stalls per instruction
= 1 + 2% × 20 + 0.5% × 400
= 1 + 0.4 + 2.0 = 3.4
Thus, the processor with the secondary cache is faster by
9.0
= 2.6
3.4

4.6 Multilevel Cache Considerations


• Primary cache
– Focus on minimal hit time
• L-2 cache
– Focus on low miss rate to avoid main memory access
– Hit time has less overall impact
• Results
– L-1 cache usually smaller than a single cache
– L-1 block size smaller than L-2 block size

4.7 Interactions with Advanced CPUs


• Out-of-order CPUs can execute instructions during cache miss
– Pending store stays in load/store unit
– Dependent instructions wait in reservation stations
∗ Independent instructions continue
• Effect of miss depends on program data flow
– Much harder to analyse
– Use system simulation

19
4.8 Interactions with Software
Misses depend on memory access patterns:
• Algorithm behavior
• Compiler optimization for memory access
Example: Compare between Quicksort and Radix Sort

Figure 16: Comparing Quicksort and Radix Sort by (a) instructions executed per item sorted, (b) time per
item sorted, and (c) cache misses per item sorted

4.9 Software Optimization via Blocking


The goalis to maximize accesses to data before it is replaced
Example: the inner loops of DGEMM

20
for (int j = 0; j < n; ++j)
{
double cij = C[i+j*n]; /* cij = C[i][j] */
for( int k = 0; k < n; k++ )
cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
C[i+j*n] = cij; /* C[i][j] = cij */
}
}
It reads all N-by-N elements of B, reads the same N elements in what corresponds to one row of A repeatedly,
and writes what corresponds to one row of N elements of C.

Figure 17: A snapshot of the three arrays C, A, and B when N = 6 and i = 1

A cache block version of DGEMM:


1 #define BLOCKSIZE 32
2 void do_block (int n, int si, int sj, int sk, double *A, double
3 *B, double *C)
4 {
5 for (int i = si; i < si+BLOCKSIZE; ++i)
6 for (int j = sj; j < sj+BLOCKSIZE; ++j)
7 {
8 double cij = C[i+j*n];/* cij = C[i][j] */
9 for( int k = sk; k < sk+BLOCKSIZE; k++ )
10 cij += A[i+k*n] * B[k+j*n];/* cij+=A[i][k]*B[k][j] */
11 C[i+j*n] = cij;/* C[i][j] = cij */
12 }
13 }
14 void dgemm (int n, double* A, double* B, double* C)
15 {
16 for ( int sj = 0; sj < n; sj += BLOCKSIZE )
17 for ( int si = 0; si < n; si += BLOCKSIZE )
18 for ( int sk = 0; sk < n; sk += BLOCKSIZE )
19 do_block(n, si, sj, sk, A, B, C);
20 }

21
Figure 18: The age of accesses to the arrays C, A, and B when BLOCKSIZE 3

Figure 19: Performance of unoptimized DGEMM versus cache blocked DGEMM

5 Dependable memory hierarchy


5.1 Defining Failure
Users can then see a system alternating between two states of delivered service with respect to the service
specification:
• Service accomplishment, where the service is delivered as specified
• Service interruption, where the delivered service is different from the specified service

Transitions from state 1 to state 2 are caused by failures, and transitions from state 2 to state 1 are called
restorations.
Failures can be permanent or intermittent and Fault is failure of a component that may or may not lead to
system failure

• Reliability: mean time to failure (MTTF)


• Service interruption: mean time to repair (MTTR)
• Mean time between failures (MTBF) = MTTF + MTTR

• Availability = MTTF / (MTTF + MTTR) ⇒ Improving Availability by


– Increase MTTF: fault avoidance, fault tolerance, fault forecasting
– Reduce MTTR: improved tools and processes for diagnosis and repair

22
5.2 The Hamming SEC Code
• The Hamming distance between two binary bit patterns is the number of positions where the bits are
different. Example: 1011 vs 1110 → Hamming distance is 2 (they differ in positions 2 and 4).
• Minimum Distance = 2 can provide a single bit error detection
• Minimum Distance = 3 can provide single error correction and 2-bit error detection

5.3 Encoding SEC


To calculate Hamming code:
• Step 1: Start numbering bits from 1 on the left , as opposed to the traditional numbering of the
rightmost bit being 0.
• Step 2: Mark all bit positions that are powers of 2 as parity bits (positions 1, 2, 4, 8, 16, ...).
• Step 3: All other bit positions are used for data bits (positions 3, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, ...).
• Step 4: Each parity bit checks certain data bits

• Step 5: Set parity bits to create even parity for each group.

Figure 20: Parity bits, data bits, and fi eld coverage in a Hamming ECC code for eight data bits

5.4 Decoding SEC


You can then determine whether bits are incorrect by looking at the parity bits. Using the 12 bit code in
Figure 20. Parity bits = 0000 indicates no error and parity bits = 1010 indicates bit 10 was flipped

Example: Assume one byte data value is 10011010two. First show the Hamming ECC code for that byte,
and then invert bit 10 and show that the ECC code fi nds and corrects the single bit error.

Leaving spaces for the parity bits, the 12-bit pattern is:

__1_001_1010
Position 1 checks bits 1, 3, 5, 7, 9, and 11, which we highlight:

__ 1 _ 0 0 1 _ 1 0 1
There are an odd number of 1s (i.e., 3), so to make it even parity, we set bit 1 to 0.
Position 2 checks bits 2, 3, 6, 7, 10, 11:

0 _1 _ 0 0 1 _ 1 0 1 0
Again, odd parity, so we set bit 2 to 1.

23
Position 4 checks bits 4, 5, 6, 7, 12:

0 1 1_ 0 0 1 _ 1 0 1
Odd again, so we set bit 4 to 1.
Position 8 checks bits 8, 9, 10, 11, 12:

0111001_1010
It contains an odd number of 1s, so we set bit 8 to 0.

The final code word is:

011100101010
Inverting bit 10 changes it to:

011100101110

Parity bit 1 is 0 (011100101110 has four 1s in its group, so even parity; this group is OK).
Parity bit 2 is 1 (011100101110 has five 1s in its group, so odd parity; there is an error somewhere).
Parity bit 4 is 1 (011100101110 has two 1s in its group, so even parity; this group is OK).
Parity bit 8 is 1 (011100101110 has three 1s in its group, so odd parity; there is an error somewhere).
Parity bits 2 and 8 are incorrect. As 2 + 8 = 10, bit 10 must be wrong. Hence, we can correct the error
by inverting bit 10:

011100101010

5.5 SEC/DEC Code


Add an additional parity bit for the whole word (pn )

Make Hamming distance = 4

Decoding: Let H = SEC parity bits:

• H even, pn even: no error


• H odd, pn odd: correctable single bit error
• H even, pn odd: error in pn bit
• H odd, pn even: double error occurred

Note: ECC DRAM uses SEC/DEC with 8 bits protecting each 64 bits.

6 Virtual Machines
6.1 Virtual Machines (VMs)
• It allows one computer (host) to emulate another (guest).
• Multiple guest systems can run on a single host - safely and independently.
• Benefits:
– Isolation: One VM crashing doesn’t affect others.
– Security: Safer testing environment for new software or malware.
– Resource Sharing: One machine can be used more efficiently.

24
6.2 Virtual Machine Monitor (VMM)
• It manages virtual machines and maps virtual resources to the real physical hardware.
• Guest OSes run in user mode, and switch to the VMM on privileged actions (like I/O or interrupt
control).
• The VMM handles:

– Real I/O devices (keyboard, screen, disk)


– Virtual I/O devices (shown to guest OS)
• Example - Timer Virtualization:
– In normal OS: the timer interrupt switches between processes.
– In VM: the VMM switches between VMs and fakes a timer interrupt for the guest OS.

6.3 Instruction Set Support


• CPUs have two modes:
– User Mode: Used by apps and guest OS.
– System Mode: Full access for the real OS or VMM.
• Privileged instructions (like accessing memory or I/O) can only run in system mode.

• If guest OS tries to run privileged code in user mode:


– The CPU traps to the VMM.
– VMM then handles the request safely.
• Modern CPUs (like x86) now include hardware support to make virtualization faster and easier.

7 Virtual Memory
7.1 Virtual Memory
• Virtual memory allows programs to use more memory than what’s physically available (RAM).
• It gives each program its own private memory space.
• The OS and CPU work together to move data between fast RAM and slow disk storage.

7.2 Address Translation


• Programs use virtual addresses, but real data lives at physical addresses.

• Memory is divided into pages (e.g: 4KB).


• A page table maps virtual pages to physical locations.

25
7.3 Speeding Up Translation: TLB
• TLB (Translation Lookaside Buffer) is a small, fast memory inside the CPU that stores recent
page table lookups.
• A TLB hit means quick access; a miss means the CPU checks the page table (slower).

7.4 Page Faults and Their Cost


• If a page isn’t in RAM, a page fault occurs.

• The OS loads the page from disk into RAM - this takes millions of cycles.
• Page faults should be rare for good performance.

7.5 Handling TLB Misses and Page Faults


• TLB Miss: If the page is in RAM, copy its info into the TLB.
• Page Fault: If not in RAM, the OS loads it from disk, updates the page table, and restarts the
instruction.

7.6 Page Replacement and Writes


• When RAM is full, the OS replaces a page - usually one not used recently (LRU).

• Pages changed in RAM are marked dirty and must be saved to disk.
• To save time, only dirty pages are written back.

7.7 Memory Protection


• Each program is kept separate to prevent crashes and data leaks.
• The OS uses hardware features:
– User mode: Limited access.
– Supervisor mode: Full access for OS.
– ecall is used by programs to request help from the OS.

26
8 A common framework for Memory Hierachies
8.1 Memory Hierarchy
• The computer memory is organized in levels: registers, cache, main memory (RAM), and disk.
• Lower levels (e.g: disk) are slower but larger, upper levels (e.g: registers) are faster but smaller.
• Each level stores data from the level below to make access faster - this is called caching.

8.2 Block Placement & Finding a Block


• A block is a small chunk of memory stored in cache.

• Where can a block be placed? Depends on associativity:


– Direct-mapped: Block can go in only one place.
– Set-associative: Block can go in one of a few (e.g., 2 or 4) spots in a set.
– Fully-associative: Block can go anywhere in cache.

• More associativity = fewer misses, but more hardware complexity and time.
• Cache uses tags to search for blocks quickly.

8.3 Replacement
• When the cache is full, it must decide which block to replace.
• Common strategies:
– Least Recently Used (LRU): Replace the block that hasn’t been used for the longest time.
– Random: Pick a block randomly (simpler, almost as good as LRU).
• Virtual memory systems also use similar ideas to choose which pages to remove from RAM.

8.4 Write Policy


• Write-through: Write to both cache and memory. Easy, but slower needs a buffer.
• Write-back: Write only to cache at first. Write to memory later when the block is replaced. More
efficient but needs extra state (e.g., dirty bit).

• In virtual memory, only write-back is practical because writing to disk is very slow.

8.5 Sources of Misses


• Cache miss: When the needed data is not found in the cache.
• Types of misses:
– Compulsory miss (cold start): First time data is accessed.
– Capacity miss: Cache is too small to hold all needed data.
– Conflict miss: Two or more blocks compete for the same spot in cache (common in low-
associativity caches).

27
9 Simple Cache control
9.1 Cache Setup
• A cache is a small, fast memory that stores recently accessed data from the larger main memory, to
speed up future access.
• Let’s look at a common cache configuration:
– Direct-mapped: Each block of memory can go in only one place in the cache.
– Write-back: When the CPU changes data, it is only written to the cache. It will update the
main memory later when the cache block is replaced.
– Write-allocate: On a write miss, the cache loads the block from memory first, then writes to it.
• Example specs:
– Block size: 4 words = 16 bytes per block.
– Cache size: 16 KB = 1024 blocks.
– Addresses: 32-bit addresses are used to access memory.
• Each cache block has:
– A valid bit - tells if the block contains real data.
– A dirty bit - tells if the block was modified (used for write-back).
• This is a blocking cache, meaning the CPU waits (stalls) until the cache access is done.

9.2 Interface Signals


• Communication between CPU, cache, and memory uses specific signals:
– Address (32-bit): the memory location the CPU wants.
– Read/Write: tells if the CPU is reading or writing data.
– Write Data: the data sent from the CPU to the cache on a write.
– Read Data: the data returned from the cache to the CPU on a read.
– Valid: indicates a valid memory operation is happening.
– Ready: signals when the operation is done and data is available.
• These signals help coordinate correct data flow between the CPU, the cache, and main memory.

9.3 Controlling the Cache with Finite State Machines (FSM)


• A Finite State Machine (FSM) is used to control the steps needed for the cache to work.
• The FSM moves through different states, one per clock cycle:
– For example: Idle ⇒ Check Hit ⇒ Access Memory on Miss ⇒ Return Data
• The FSM has:
– A state register - stores the current step (state).
– A next state function - decides what the next step is based on inputs (e.g., hit/miss).
– Control outputs - send signals to read/write memory, update valid/dirty bits, etc.
• FSM keeps everything organized and ensures that:
– The CPU gets the correct data.
– Cache and memory are updated correctly.
– Any cache misses are handled properly.
• In more advanced designs, the FSM can be split into smaller pieces to make cache control faster.

28
10 Cache Coherence
10.1 Cache Coherence
• In multi-core CPUs, each core has its own cache.
• All cores share the same main memory and might access the same data.
• If one core updates a value, but others still use old cached values, this causes inconsistency.

• Cache coherence ensures all cores see the most recent value of shared data.
• Example:

10.2 Coherence Rules


• Self-read rule: A CPU must see its own last write.
• Shared-read rule: Other CPUs must see the updated value, if they read later.
• Write order rule: If two CPUs write to the same location, all CPUs must agree on the order.

10.3 Cache Coherence Protocols


• Protocols manage how caches keep shared data up to date.
• Two main types:

– Snooping: Caches "listen" to a shared bus and watch for other reads/writes.
– Directory-based: A central directory tracks which caches have copies of each data block.
• Goals:

– Replication: Allow multiple caches to hold read-only copies.


– Migration: Move updated data closer to the CPU using it.

10.4 Invalidating Snooping Protocols


• If a CPU wants to write, it must invalidate other copies first.
• It sends an invalidate message, and other caches mark their copy as invalid.
• This ensures only one valid version exists.

• If another CPU wants the value later, it fetches the new one (causing a cache miss).

29
10.5 Memory Consistency
• While coherence keeps values up-to-date, consistency controls when those updates are visible to other
CPUs.
• Example rule:
– If a CPU writes to X, then Y, other CPUs must see the new value of X before they see the new Y.
• CPUs may reorder reads for performance, but must not reorder writes.
• Memory consistency is especially important for programs where the order of operations matters.

11 ARM Cortex A53 and Intel Core i7 Memory hierarchies


11.1 Multilevel On-Chip Caches
• Both processors use multiple cache levels (L1, L2, sometimes L3).
• Caches store recently used data closer to the CPU, so it can be accessed faster.
• These caches are often multi-banked:
– A cache is split into banks that can be accessed independently.
– This allows multiple data accesses in the same cycle - unless two accesses try to use the same bank
(called a bank conflict).

11.2 2-Level TLB Organization)


• TLBs help the CPU quickly translate virtual addresses into physical addresses.
• Both CPUs use a two-level TLB system:
– L1 TLB: Very fast but small (used first).
– L2 TLB: Larger and slower (used if L1 misses).
• This two-level design improves efficiency in memory access for programs.

11.3 Supporting Multiple Issue


• These CPUs support multiple instruction issues per cycle - meaning they do more than one thing at a
time. To keep up, memory access must also be fast and smart. Here’s how they manage that:
• Return Requested Word First
– When the CPU asks for data in a block, it receives the needed word first - before the rest.
– This reduces waiting time.
• Non-Blocking Caches
– Normal (blocking) caches pause the CPU during a cache miss.
– Non-blocking caches allow the CPU to keep going:
∗ Hit-under-miss: Cache can serve a hit even if a miss is still being handled.
∗ Miss-under-miss: Cache can start a new miss even if others are still pending.
• Data Prefetching
– The CPU tries to guess what data it will need soon - and loads it in advance.
– This reduces delays when the data is actually needed.

30
12 RISC-V and special instructions
• RISC-V provides special instructions for managing system-level tasks, memory ordering, and commu-
nication between user programs and the operating system.
• These are essential for OS design, system-level programming, and efficient processor control.

12.1 Memory and Instruction Fences


• Fence instructions control the ordering of memory and instruction operations, which is important in
multi-core and out-of-order processors.

– fence - Prevents memory operations from being reordered.


– fence.i - Ensures the CPU sees any changes to instructions in memory.
– sfence.vm - Ensures that changes to address translation (e.g., page tables) take effect properly.

12.2 CSR (Control and Status Register) Instructions


• CSR instructions allow access to special registers that control system features like interrupts, mode
control, and counters.

– csrrw - Read a CSR and write a new value.


– csrrs - Read a CSR and set specific bits.
– csrrc - Read a CSR and clear specific bits.
– csrrwi - Write an immediate value to a CSR.
– csrrsi - Set CSR bits using an immediate value.
– csrrci - Clear CSR bits using an immediate value.

12.3 System Control Instructions


• These are used for OS-level communication, debugging, and handling interrupts or exceptions.

– ecall - Makes a system call (used to request services from the OS).
– ebreak - Triggers a breakpoint for debugging.
– sret - Returns from an exception in supervisor mode.
– wfi - Waits for an interrupt; puts CPU in low-power mode.

13 Cache blocking and matrix multiply


13.1 DGEMM
• DGEMM = Double-Precision General Matrix Multiply.
• It performs:
C =A×B+C

• A, B, and C are matrices with double-precision numbers.

31
13.2 Cache Blocking
• Matrix multiplication needs a lot of memory access.
• Cache is much faster than main memory.
• Cache blocking breaks large matrices into smaller blocks (tiles) that fit in cache.
• This reduces cache misses and improves speed.

13.3 Subword Parallelism


• Also called SIMD: Single Instruction, Multiple Data.

• One instruction processes multiple numbers at once (e.g., 2 or 4 doubles).


• Saves time by doing several operations in parallel.

13.4 Combining Both for Speed


• Cache blocking makes better use of fast memory.
• Subword parallelism makes better use of the processor.

• Together, they make matrix multiplication (like DGEMM) much faster.

14 Fallacies and Pitfalls


14.1 Fallacies
• "Memory behavior doesn’t matter when writing code."
⇒ Ignoring memory access patterns leads to slower programs. Writing cache-friendly code (e.g., blocking
in matrix multiply) can significantly boost performance.
• "Cache simulation is just using addresses."
⇒ Always consider byte addressing and block size. Mistakes here lead to incorrect cache mappings.
Double-check whether the address is in bytes, words, or blocks.
• "Disk failure rates match the specs."
⇒ Real-world disks often fail 2-5x more frequently than datasheet claims. Always account for this risk
in system design.

• "Operating systems are best at scheduling disk accesses."


⇒ The disk controller knows its internal layout better than the OS. It can reorder requests more
efficiently to minimize seek and rotation time.
• "Any CPU can run virtual machines reliably."
⇒ Some CPUs weren’t designed for virtualization. Certain instructions leak privileged info or silently
fail. Hardware support (e.g., Intel VT-x, AMD-V) or OS changes (paravirtualization) are required.

14.2 Pitfalls
• Ignoring memory hierarchy when coding or compiling
⇒ Memory-unaware programs may run much slower. Optimizing for cache (e.g., loop blocking) can
double performance.
• Forgetting byte addressing or block size in cache simulation
⇒ A common mistake in exercises. Always break down addresses into block index and offset correctly.

32
• Using too-low associativity in shared caches
⇒ With more threads than cache ways, accidental conflicts can occur. This causes performance drops
in multi-core systems.
• Relying on average memory access time for out-of-order processors
⇒ These CPUs run multiple instructions in parallel, so AMAT doesn’t fully capture actual performance.
Use simulation instead.
• Expanding memory using visible segments instead of flat space
⇒ Segment-based addressing complicates pointers and arrays. Flat memory is more scalable and easier
to program.

• Running VMs on non-virtualizable ISAs


⇒ Some architectures (like early x86) allow unsafe instructions in user mode. This breaks isolation.
Use CPUs with virtualization support or modify the OS.

33

You might also like