0% found this document useful (0 votes)

105 views48 pages

Large and Fast: Exploiting Memory Hierarchy

This document discusses computer memory hierarchies and caches. It begins by noting that while on-chip memory is fast, programs require more memory than can fit on chip. To address this, computers use a memory hierarchy with multiple cache levels between fast on-chip memory and slower main memory. The caches exploit locality to achieve fast average memory access times. Caches are organized with blocks and use tags, indexes, and offsets to determine hits and misses. Direct mapped and set associative mappings are described. Cache misses are classified and reducing capacity misses involves increasing cache size.

Uploaded by

api-26072581

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views48 pages

Large and Fast: Exploiting Memory Hierarchy

Uploaded by

api-26072581

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 48

CSE313 – Computer

Architecture

Large and Fast:

Exploiting Memory
Hierarchy
The memory dilemma
 Ch 6 assumption: on-chip instruction and data
memories hold the entire program and its data
and can be accessed in one cycle
 Reality check
 Inhigh performance machines, programs may require 100’s of
megabytes or even gigabytes of memory to run
 Embedded processors have smaller needs but there is also less
room for on-chip memory
 Basic problem
 We need much more memory than can fit on the
microprocessor chip
 But we do not want to incur stall cycles every time the pipeline
accesses instructions or data
 At the same time, we need the memory to be economical for
the machine to be competitive in the market
Solution: a hierarchy of memories
Another view
Typical characteristics of each
level
 First level (L1) is separate on-chip instruction
and data caches placed where our instruction
and data memories reside
 16-64KB for each cache (desktop/server machine)
 Fast, power-hungry, not-so-dense, static RAM (SRAM)

 Second level (L2) consists of another larger

unified cache
 Holds both instructions and data
 256KB-4MB
 On or off-chip
 SRAM

 Third level is main memory

 64-512MB
 Slower, lower-power, denser dynamic RAM (DRAM)
 Final level is I/O (e.g., disk)
Caches and the pipeline
 L1 instruction and data caches and L2
cache

L1 L1

cache cache

L2 cache

to mm
Memory hierarchy operation

(1) Search L1 for the instruction or data

L1
If found (cache hit), done
(2) Else (cache miss), search L2 cache
L2 If found, place it in L1 and repeat (1)
(3) Else, search main memory
If found, place it in L2 and repeat (2)

main
(4) Else, get it from I/O (Chapter 8)
memory

Steps (1)-(3) are performed in hardware

 1-3 cycles to get from L1 caches
 5-20 cycles to get from L2 cache
 50-200 cycles to get from main memory
Principle of locality
 Programs access a small portion of
memory within a short time period

 Temporal locality: recently accessed

memory locations will likely be accessed
soon

 Spatial locality: memory locations near

recently accessed locations will likely be
accessed soon
POL makes memory hierarchies
work
 A large percentage of the time (typically
>90%) the instruction or data is found in
L1, the fastest memory

 Cheap, abundant main memory is accessed

more rarely

  Memory hierarchy operates at nearly the

speed of expensive on-chip SRAM with
about the cost of main memory (DRAMs)
How caches exploit the POL
 On a cache miss, a block of several
instructions or data, including the
requested item, are returned
requested instruction

instruction instructioni+1 instructioni+2 instructioni+3

 The entire block is placed into the cache so

that future searches for items in the block
will be successful
How caches exploit the POL
 Consider sequence of instructions and data
accesses in this loop with a block size of 4
words
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1 , -4
bne $s1, $zero, Loop
Searching the cache
 The cache is much smaller than main
memory

 Multiple memory blocks must share the

same cache location

block
Searching the cache
 Need a way to determine whether the
desired instruction or data is held in the
cache
 Need a scheme for replacing blocks when a
new block needs to be brought in on a miss

block
Cache organization alternatives
 Direct mapped: each block can be placed in
only one cache location

 Set associative: each block can be placed

in any of n cache locations

 Fully associative: each block can be placed

in any cache location
Cache organization alternatives
 Searching for block 12 in caches of size 8
blocks

Set Set # 0
Searching a direct mapped cache
 Need log2 number of sets of the
Set
address bits (the index) to
select the block location
 block offset used to select the
desired byte, half-word, or
word within the block
 Remaining bits (the tag) used
to determine if this is the
desired block or another that
assume data cache shares the same cache location
with 16 byte blocks
8 sets
4 block offset bits
block
3 index bits tag index offset
25 tag bits memory address
Searching a direct mapped cache
 Block is placed in the set index
Set
 number of sets = cache
size/block size

assume data cache

with 16 byte blocks
8 sets
4 block offset bits
block
3 index bits tag index offset
25 tag bits memory address
Direct mapped cache organization
 64KB instruction cache with 16 byte (4 word)
blocks

 4K sets (64KB/16B)  need 12 address bits to pick

Direct mapped cache organization
 The data section of the cache holds the
instructions
Direct mapped cache organization
 The tag section holds the part of the
memory address used to distinguish
different blocks
Direct mapped cache organization
 A valid bit associated with each set
indicates if the instructions are valid or not
Direct mapped cache access
 The index bits are used to select one of the
sets
Direct mapped cache access
 The data, tag, and Valid bit from the
selected set are simultaneously accessed
Direct mapped cache access
 The tag from the selected entry is
compared with the tag field of the address
Direct mapped cache access
 A match between the tags and a Valid bit
that is set indicates a cache hit
Direct mapped cache access
 The block offset selects the desired
instruction
Set associative cache
 Block placed in one way of set index
 Number of sets = cache size/(block
size*ways)
ways 0-3
Set associative cache operation
 The index bits are used to select one of the
sets
Set associative cache operation
 The data, tag, and Valid bit from all ways of
the selected entry are simultaneously
accessed
Set associative cache operation
 The tags from all ways of the selected entry
are compared with the tag field of the
address
Set associative cache operation
 A match between the tags and a Valid bit
that is set indicates a cache hit (hit in way1
shown)
Set associative cache operation
 The data from the way that had a hit is
returned through the MUX
Cache misses
 A cache miss occurs when the block is
not found in the cache
 The block is requested from the next
level of the hierarchy
L1
 When the block returns, it is loaded
L2
into the cache and provided to the
requester
 A copy of the block remains in the
lower levels of the hierarchy
main  The cache miss rate is found by
memory
dividing the total number of misses
by the total number of accesses
(misses/accesses)
 The hit rate is 1-miss rate
Classifying cache misses
 Compulsory misses
 Caused by the first access to a block that has never
been in the cache
 Capacity misses
 Dueto the cache not being big enough to hold all the
blocks that are needed
 Conflict misses
 Due to multiple blocks competing for the same set
 A fully associative cache with a “perfect”
replacement policy has no conflict misses
Cache miss classification
 Direct mapped examples
cache of size two blocks
 Blocks A and B map to set 0, C and D to set
1
 Access
A
patternB 1: A, B, C,
B
D, A, B, BC, D
0 0 0 0
1 1 1 C 1 D

?? ?? ?? ??

0 A 0 B 0 B 0 B
1 D 1 D 1 C 1 D

?? ?? ?? ??

 Access pattern 2: A, A, B, A
0 A 0 A 0 B 0 A
1 1 1 1

?? ?? ?? ??
Reducing capacity misses
 Increase the cache size
 More cache blocks can be simultaneously held in the
cache
 Drawback: increased access time
Block replacement policy
 Determines what block to replace on a cache
miss to make room for the new block
 Least recently used (LRU)
 Pickthe one that has been unused for the longest time
 Based on temporal locality
 Requires ordering bits to be kept with each set
 Too expensive beyond 4-way

 Random
 Pseudo-randomly pick a block
 Generally not as effective as LRU (higher miss rates)
 Simple even for highly associative organizations

 Most recently used (MRU)

 Keeptrack of which block was accessed last
 Randomly pick a block from other than that one
 Compromise between LRU and random
Cache writes and block
replacement
 With a write back cache, when a block is
written to, copies of the block in the lower
levels are not updated
 If this block is chosen for replacement on a
miss, we need save it to the next level
 Solution:
A dirty bit is associated with each cache block
 The dirty bit is set if the block is written to
 A block with a set dirty bit that is chosen for
replacement is written to the next level before being
overwritten with the new block
Virtual memory
 A single program may require more main
memory than is present in the machine

 Multiple programs must share main memory

without interfering with each other

 With virtual memory , portions of different

programs are loaded from I/O to memory on
demand

 When memory gets full, portions of

programs are swapped out to I/O
Implementing virtual memory
 Separate different programs in memory by
assigning different memory addresses to
each
 Identify when the desired instructions or
data are in memory or not
 Generate an interrupt when they are not in
memory and must be retrieved from I/O
 Provide support in the OS to retrieve the
desired instructions or data, replacing
others if necessary
 Prevent users from accessing instructions
or data they do not own
Memory pages
 Transfers between disk
and memory occur in
pages whose size is
defined in the ISA
L2
cache
 Page size is large to cache block
(16-128B)
amortize the high cost
of disk access (~5- disk
main
10ms) page
(4-64KB)
memory

 Tradeoffs in increasing
page size are similar as
for cache block size
Virtual and physical addresses
 The virtual addresses in your program are
translated during program execution into
the physical addresses used to address
memory
 Some virtual addresses may refer to data
that is not in memory (not memory
resident)
Address translation
 The virtual page number (vpn) part of the
virtual address is translated into a physical
page number (ppn) that points to the
desired page
 The low-order page offset bits point to
which byte is being accessed within a page
 The ppn + page offset form the physical
address
4GB of virtual
address space

212 = 4KB page

1GB of main memory

in the machine
Address translation
 Another view of the physical address

ppn page offset

location of first
byte in page

2pageoffset bytes

location of
addressed byte
in page

page
Address translation
 Address translation is performed by the
hardware and the operating system (OS)

 The OS maintains for each program

 What pages are associated with it
 Where on disk each page resides
 What pages are memory resident
 The ppn associated with the vpn of each memory
resident page
Address translation
 For each program, the OS sets up a page
table in memory that holds the ppn
corresponding to the vpn of each memory
resident page

 The page table register in hardware

provides the base address of the page
table for the currently running process

 Each program has a unique page table and

page table register value that is loaded by
the OS
Address translation
 Page table access

address is
offset into the page table register + vpn
page table
located in memory 
requires loads/stores to
access!
The TLB: faster address
translation
 Major problem: for each instruction or data
access we have to first access the page
table in memory to get the physical address
 Solution: cache the address translations in a
Translation Lookaside Buffer (TLB) in
hardware
 The TLB holds the ppns of the most recently
accessed pages
 Hardware first checks the TLB for the
vpn/ppn pair; if not found (TLB miss), then
the page table is accessed to get the ppn
 The ppn is loaded into the TLB for later
access

Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
No ratings yet
Advanced Computer Architecture: BY Dr. Radwa M. Tawfeek
32 pages
Cache1 2
No ratings yet
Cache1 2
30 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
Computer Architecture: Cache Design
No ratings yet
Computer Architecture: Cache Design
61 pages
CAO - Lecutre7 Cache Memory
100% (1)
CAO - Lecutre7 Cache Memory
39 pages
Lecture 8
No ratings yet
Lecture 8
33 pages
6.module 2 - Part 2
No ratings yet
6.module 2 - Part 2
39 pages
04 - Cache Memory
No ratings yet
04 - Cache Memory
61 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
04 - Cache Memory
No ratings yet
04 - Cache Memory
47 pages
Cache Memory
No ratings yet
Cache Memory
39 pages
Chap 6
No ratings yet
Chap 6
48 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
51 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
46 pages
Lec8 - Caches
No ratings yet
Lec8 - Caches
55 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
76 pages
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 7th Edition Cache Memory
57 pages
11 Cache Memory
No ratings yet
11 Cache Memory
40 pages
Lec 5
No ratings yet
Lec 5
29 pages
4.1 Computer Memory System Overview
No ratings yet
4.1 Computer Memory System Overview
12 pages
Computer Architecture: Cache Memory
No ratings yet
Computer Architecture: Cache Memory
57 pages
CH04 COA10e
No ratings yet
CH04 COA10e
41 pages
Cache Memory
No ratings yet
Cache Memory
47 pages
Memory Hierarchy and Cache Design
No ratings yet
Memory Hierarchy and Cache Design
53 pages
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
No ratings yet
Characteristics Location Capacity Unit of Transfer Access Method Performance Physical Type Physical Characteristics Organisation
53 pages
Cache Memory
No ratings yet
Cache Memory
51 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
13 - Large and Fast Exploiting Memory Hierarchy Final
No ratings yet
13 - Large and Fast Exploiting Memory Hierarchy Final
118 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
46 pages
CH04
No ratings yet
CH04
46 pages
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
No ratings yet
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
25 pages
Computer Organization and Architecture: Cache Memory
100% (1)
Computer Organization and Architecture: Cache Memory
57 pages
CH04 COA10e
No ratings yet
CH04 COA10e
46 pages
Unit 4
No ratings yet
Unit 4
72 pages
CS2115 Chapter-6
No ratings yet
CS2115 Chapter-6
45 pages
Memory Systems for Engineers
No ratings yet
Memory Systems for Engineers
77 pages
Cache Memory Characteristics
No ratings yet
Cache Memory Characteristics
67 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
No ratings yet
William Stallings Computer Organization and Architecture 6th Edition Cache Memory
55 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
Lecture 13 - Introduction To Cache
No ratings yet
Lecture 13 - Introduction To Cache
47 pages
Unit 6
No ratings yet
Unit 6
25 pages
Week 12 - Lecture 12 - Memory
No ratings yet
Week 12 - Lecture 12 - Memory
27 pages
Chapter 4 Memory Organization Lecture
No ratings yet
Chapter 4 Memory Organization Lecture
54 pages
CH04 COA9e
No ratings yet
CH04 COA9e
58 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Computer Architecture: Memory Organization
No ratings yet
Computer Architecture: Memory Organization
65 pages
Introduction To Cache Memory: Lecture 4A
No ratings yet
Introduction To Cache Memory: Lecture 4A
31 pages
Cache Memory
No ratings yet
Cache Memory
57 pages
Cache Mapping
100% (1)
Cache Mapping
44 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Cache Memory Architecture Guide
No ratings yet
Cache Memory Architecture Guide
33 pages
Lec 4
No ratings yet
Lec 4
31 pages
CH04 COA9e Cache Memory Repaired
No ratings yet
CH04 COA9e Cache Memory Repaired
42 pages
Memory Hierarchy Design Guide
No ratings yet
Memory Hierarchy Design Guide
54 pages
Bangladesh & UIU Overview
No ratings yet
Bangladesh & UIU Overview
41 pages
Instruction Set Architecture: Mips Section 2.1-2.5
No ratings yet
Instruction Set Architecture: Mips Section 2.1-2.5
27 pages
Instruction Set Architecture: Mips Section 2.6-2.7,2.8,2.9
No ratings yet
Instruction Set Architecture: Mips Section 2.6-2.7,2.8,2.9
26 pages
Instructions: Language of The Computer
No ratings yet
Instructions: Language of The Computer
17 pages
Inroduction
No ratings yet
Inroduction
17 pages
Performance
No ratings yet
Performance
42 pages
Processor Datapath and Control
No ratings yet
Processor Datapath and Control
101 pages
Enhancing Performance With Pipelining
No ratings yet
Enhancing Performance With Pipelining
71 pages
MPK40 Basic
No ratings yet
MPK40 Basic
31 pages
Motorola GP328 Program Data
No ratings yet
Motorola GP328 Program Data
3 pages
Java Mainsit
No ratings yet
Java Mainsit
18 pages
4.5m Earth Station Antenna: Assembly, Installation, Operations, & Maintenance Manual
100% (1)
4.5m Earth Station Antenna: Assembly, Installation, Operations, & Maintenance Manual
29 pages
Word Processing Application
No ratings yet
Word Processing Application
44 pages
Daraz Single Product Listing Guide
No ratings yet
Daraz Single Product Listing Guide
11 pages
CHAPTER 2: Pointer Types and Arrays: Int PTR
No ratings yet
CHAPTER 2: Pointer Types and Arrays: Int PTR
5 pages
Old Exam
No ratings yet
Old Exam
104 pages
Test Bank For Modern Business Statistics With Microsoft Excel, 6th Edition, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm James J. Cochran PDF Download
100% (5)
Test Bank For Modern Business Statistics With Microsoft Excel, 6th Edition, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm James J. Cochran PDF Download
60 pages
Vinafix - VN - Tra Ma IC Richtek 1
No ratings yet
Vinafix - VN - Tra Ma IC Richtek 1
42 pages
Lp140wh2 Tle2 LG
No ratings yet
Lp140wh2 Tle2 LG
27 pages
KCAA/UAS/IP/20210108: Unmanned Aircraft System (UAS) Import Permit
No ratings yet
KCAA/UAS/IP/20210108: Unmanned Aircraft System (UAS) Import Permit
1 page
Layer 2 Testing Guide for Engineers
No ratings yet
Layer 2 Testing Guide for Engineers
52 pages
Ats1600 Brochure en
No ratings yet
Ats1600 Brochure en
1 page
Automatic Payment Program: Configuration Document
No ratings yet
Automatic Payment Program: Configuration Document
56 pages
Sample Ieee Literature Review
67% (3)
Sample Ieee Literature Review
7 pages
1cs6bbcsg 694504
No ratings yet
1cs6bbcsg 694504
11 pages
PD Syllabus
No ratings yet
PD Syllabus
2 pages
NSP 22.11 Simplified RAN Transport Solution
100% (1)
NSP 22.11 Simplified RAN Transport Solution
38 pages
VHDL Data Types & Usage Guide
No ratings yet
VHDL Data Types & Usage Guide
70 pages
AccountStatement Report 6068523653 29052024 12 28
No ratings yet
AccountStatement Report 6068523653 29052024 12 28
2 pages
White Paper: Six Striking Truths That Will Change Your Perception of Power
No ratings yet
White Paper: Six Striking Truths That Will Change Your Perception of Power
12 pages
CPFR Roadmap
No ratings yet
CPFR Roadmap
142 pages
sb95 ECDIS Color Calibration
No ratings yet
sb95 ECDIS Color Calibration
31 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
Mobile Tracking Memanfaatkan Teknologi Global Positioning System (GPS) Dan General Packet Radio Service (GPRS
No ratings yet
Mobile Tracking Memanfaatkan Teknologi Global Positioning System (GPS) Dan General Packet Radio Service (GPRS
7 pages
ADA - UNIT-2 - Chapter-1 - SETS REPRESENTATION
No ratings yet
ADA - UNIT-2 - Chapter-1 - SETS REPRESENTATION
18 pages
LTN101NT06-001 Display
No ratings yet
LTN101NT06-001 Display
32 pages
Class 9 Update Syllabus
No ratings yet
Class 9 Update Syllabus
13 pages
Daftar Anak
No ratings yet
Daftar Anak
48 pages

Large and Fast: Exploiting Memory Hierarchy

Uploaded by

Large and Fast: Exploiting Memory Hierarchy

Uploaded by

CSE313 – Computer

Large and Fast:

 Second level (L2) consists of another larger

 Third level is main memory

(1) Search L1 for the instruction or data

Steps (1)-(3) are performed in hardware

 Temporal locality: recently accessed

 Spatial locality: memory locations near

 Cheap, abundant main memory is accessed

  Memory hierarchy operates at nearly the

instruction instructioni+1 instructioni+2 instructioni+3

 The entire block is placed into the cache so

 Multiple memory blocks must share the

 Set associative: each block can be placed

 Fully associative: each block can be placed

assume data cache

 4K sets (64KB/16B)  need 12 address bits to pick

 Most recently used (MRU)

 Multiple programs must share main memory

 With virtual memory , portions of different

 When memory gets full, portions of

212 = 4KB page

1GB of main memory

ppn page offset

 The OS maintains for each program

 The page table register in hardware

 Each program has a unique page table and

You might also like