CSE313 – Computer
Architecture
Large and Fast:
Exploiting Memory
Hierarchy
The memory dilemma
Ch 6 assumption: on-chip instruction and data
memories hold the entire program and its data
and can be accessed in one cycle
Reality check
Inhigh performance machines, programs may require 100’s of
megabytes or even gigabytes of memory to run
Embedded processors have smaller needs but there is also less
room for on-chip memory
Basic problem
We need much more memory than can fit on the
microprocessor chip
But we do not want to incur stall cycles every time the pipeline
accesses instructions or data
At the same time, we need the memory to be economical for
the machine to be competitive in the market
Solution: a hierarchy of memories
Another view
Typical characteristics of each
level
First level (L1) is separate on-chip instruction
and data caches placed where our instruction
and data memories reside
16-64KB for each cache (desktop/server machine)
Fast, power-hungry, not-so-dense, static RAM (SRAM)
Second level (L2) consists of another larger
unified cache
Holds both instructions and data
256KB-4MB
On or off-chip
SRAM
Third level is main memory
64-512MB
Slower, lower-power, denser dynamic RAM (DRAM)
Final level is I/O (e.g., disk)
Caches and the pipeline
L1 instruction and data caches and L2
cache
L1 L1
cache cache
L2 cache
to mm
Memory hierarchy operation
(1) Search L1 for the instruction or data
L1
If found (cache hit), done
(2) Else (cache miss), search L2 cache
L2 If found, place it in L1 and repeat (1)
(3) Else, search main memory
If found, place it in L2 and repeat (2)
main
(4) Else, get it from I/O (Chapter 8)
memory
Steps (1)-(3) are performed in hardware
1-3 cycles to get from L1 caches
5-20 cycles to get from L2 cache
50-200 cycles to get from main memory
Principle of locality
Programs access a small portion of
memory within a short time period
Temporal locality: recently accessed
memory locations will likely be accessed
soon
Spatial locality: memory locations near
recently accessed locations will likely be
accessed soon
POL makes memory hierarchies
work
A large percentage of the time (typically
>90%) the instruction or data is found in
L1, the fastest memory
Cheap, abundant main memory is accessed
more rarely
Memory hierarchy operates at nearly the
speed of expensive on-chip SRAM with
about the cost of main memory (DRAMs)
How caches exploit the POL
On a cache miss, a block of several
instructions or data, including the
requested item, are returned
requested instruction
instruction instructioni+1 instructioni+2 instructioni+3
The entire block is placed into the cache so
that future searches for items in the block
will be successful
How caches exploit the POL
Consider sequence of instructions and data
accesses in this loop with a block size of 4
words
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1 , -4
bne $s1, $zero, Loop
Searching the cache
The cache is much smaller than main
memory
Multiple memory blocks must share the
same cache location
block
Searching the cache
Need a way to determine whether the
desired instruction or data is held in the
cache
Need a scheme for replacing blocks when a
new block needs to be brought in on a miss
block
Cache organization alternatives
Direct mapped: each block can be placed in
only one cache location
Set associative: each block can be placed
in any of n cache locations
Fully associative: each block can be placed
in any cache location
Cache organization alternatives
Searching for block 12 in caches of size 8
blocks
Set Set # 0
Searching a direct mapped cache
Need log2 number of sets of the
Set
address bits (the index) to
select the block location
block offset used to select the
desired byte, half-word, or
word within the block
Remaining bits (the tag) used
to determine if this is the
desired block or another that
assume data cache shares the same cache location
with 16 byte blocks
8 sets
4 block offset bits
block
3 index bits tag index offset
25 tag bits memory address
Searching a direct mapped cache
Block is placed in the set index
Set
number of sets = cache
size/block size
assume data cache
with 16 byte blocks
8 sets
4 block offset bits
block
3 index bits tag index offset
25 tag bits memory address
Direct mapped cache organization
64KB instruction cache with 16 byte (4 word)
blocks
4K sets (64KB/16B) need 12 address bits to pick
Direct mapped cache organization
The data section of the cache holds the
instructions
Direct mapped cache organization
The tag section holds the part of the
memory address used to distinguish
different blocks
Direct mapped cache organization
A valid bit associated with each set
indicates if the instructions are valid or not
Direct mapped cache access
The index bits are used to select one of the
sets
Direct mapped cache access
The data, tag, and Valid bit from the
selected set are simultaneously accessed
Direct mapped cache access
The tag from the selected entry is
compared with the tag field of the address
Direct mapped cache access
A match between the tags and a Valid bit
that is set indicates a cache hit
Direct mapped cache access
The block offset selects the desired
instruction
Set associative cache
Block placed in one way of set index
Number of sets = cache size/(block
size*ways)
ways 0-3
Set associative cache operation
The index bits are used to select one of the
sets
Set associative cache operation
The data, tag, and Valid bit from all ways of
the selected entry are simultaneously
accessed
Set associative cache operation
The tags from all ways of the selected entry
are compared with the tag field of the
address
Set associative cache operation
A match between the tags and a Valid bit
that is set indicates a cache hit (hit in way1
shown)
Set associative cache operation
The data from the way that had a hit is
returned through the MUX
Cache misses
A cache miss occurs when the block is
not found in the cache
The block is requested from the next
level of the hierarchy
L1
When the block returns, it is loaded
L2
into the cache and provided to the
requester
A copy of the block remains in the
lower levels of the hierarchy
main The cache miss rate is found by
memory
dividing the total number of misses
by the total number of accesses
(misses/accesses)
The hit rate is 1-miss rate
Classifying cache misses
Compulsory misses
Caused by the first access to a block that has never
been in the cache
Capacity misses
Dueto the cache not being big enough to hold all the
blocks that are needed
Conflict misses
Due to multiple blocks competing for the same set
A fully associative cache with a “perfect”
replacement policy has no conflict misses
Cache miss classification
Direct mapped examples
cache of size two blocks
Blocks A and B map to set 0, C and D to set
1
Access
A
patternB 1: A, B, C,
B
D, A, B, BC, D
0 0 0 0
1 1 1 C 1 D
?? ?? ?? ??
0 A 0 B 0 B 0 B
1 D 1 D 1 C 1 D
?? ?? ?? ??
Access pattern 2: A, A, B, A
0 A 0 A 0 B 0 A
1 1 1 1
?? ?? ?? ??
Reducing capacity misses
Increase the cache size
More cache blocks can be simultaneously held in the
cache
Drawback: increased access time
Block replacement policy
Determines what block to replace on a cache
miss to make room for the new block
Least recently used (LRU)
Pickthe one that has been unused for the longest time
Based on temporal locality
Requires ordering bits to be kept with each set
Too expensive beyond 4-way
Random
Pseudo-randomly pick a block
Generally not as effective as LRU (higher miss rates)
Simple even for highly associative organizations
Most recently used (MRU)
Keeptrack of which block was accessed last
Randomly pick a block from other than that one
Compromise between LRU and random
Cache writes and block
replacement
With a write back cache, when a block is
written to, copies of the block in the lower
levels are not updated
If this block is chosen for replacement on a
miss, we need save it to the next level
Solution:
A dirty bit is associated with each cache block
The dirty bit is set if the block is written to
A block with a set dirty bit that is chosen for
replacement is written to the next level before being
overwritten with the new block
Virtual memory
A single program may require more main
memory than is present in the machine
Multiple programs must share main memory
without interfering with each other
With virtual memory , portions of different
programs are loaded from I/O to memory on
demand
When memory gets full, portions of
programs are swapped out to I/O
Implementing virtual memory
Separate different programs in memory by
assigning different memory addresses to
each
Identify when the desired instructions or
data are in memory or not
Generate an interrupt when they are not in
memory and must be retrieved from I/O
Provide support in the OS to retrieve the
desired instructions or data, replacing
others if necessary
Prevent users from accessing instructions
or data they do not own
Memory pages
Transfers between disk
and memory occur in
pages whose size is
defined in the ISA
L2
cache
Page size is large to cache block
(16-128B)
amortize the high cost
of disk access (~5- disk
main
10ms) page
(4-64KB)
memory
Tradeoffs in increasing
page size are similar as
for cache block size
Virtual and physical addresses
The virtual addresses in your program are
translated during program execution into
the physical addresses used to address
memory
Some virtual addresses may refer to data
that is not in memory (not memory
resident)
Address translation
The virtual page number (vpn) part of the
virtual address is translated into a physical
page number (ppn) that points to the
desired page
The low-order page offset bits point to
which byte is being accessed within a page
The ppn + page offset form the physical
address
4GB of virtual
address space
212 = 4KB page
1GB of main memory
in the machine
Address translation
Another view of the physical address
ppn page offset
location of first
byte in page
2pageoffset bytes
location of
addressed byte
in page
page
Address translation
Address translation is performed by the
hardware and the operating system (OS)
The OS maintains for each program
What pages are associated with it
Where on disk each page resides
What pages are memory resident
The ppn associated with the vpn of each memory
resident page
Address translation
For each program, the OS sets up a page
table in memory that holds the ppn
corresponding to the vpn of each memory
resident page
The page table register in hardware
provides the base address of the page
table for the currently running process
Each program has a unique page table and
page table register value that is loaded by
the OS
Address translation
Page table access
address is
offset into the page table register + vpn
page table
located in memory
requires loads/stores to
access!
The TLB: faster address
translation
Major problem: for each instruction or data
access we have to first access the page
table in memory to get the physical address
Solution: cache the address translations in a
Translation Lookaside Buffer (TLB) in
hardware
The TLB holds the ppns of the most recently
accessed pages
Hardware first checks the TLB for the
vpn/ppn pair; if not found (TLB miss), then
the page table is accessed to get the ppn
The ppn is loaded into the TLB for later
access