KEMBAR78
Memory System Architecture Guide | PDF | Random Access Memory | Cpu Cache
0% found this document useful (0 votes)
395 views97 pages

Memory System Architecture Guide

The total number of bits required for the cache is: - Number of blocks in cache = 210 = 1024 blocks - Words per block = 22 = 4 words - Word size = 32 bits - So bits per block = 4 * 32 = 128 bits - Address size = 32 bits - So tag size = 32 - (10 + 2) = 20 bits - Total bits = 1024 * (128 + 20 + 1) = 262,144 bits = 16 KB Therefore, the total number of bits required is 16 KB, which matches the given cache size.

Uploaded by

Surya Sunder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
395 views97 pages

Memory System Architecture Guide

The total number of bits required for the cache is: - Number of blocks in cache = 210 = 1024 blocks - Words per block = 22 = 4 words - Word size = 32 bits - So bits per block = 4 * 32 = 128 bits - Address size = 32 bits - So tag size = 32 - (10 + 2) = 20 bits - Total bits = 1024 * (128 + 20 + 1) = 262,144 bits = 16 KB Therefore, the total number of bits required is 16 KB, which matches the given cache size.

Uploaded by

Surya Sunder
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

MODULE 4: MEMORY SYSTEM

ORGANIZATION & ARCHITECTURE


Ideal Memory Characteristics
• CPU should have rapid, uninterrupted access to external
memories.
• Memory speed should match CPU speed.
• Unfortunately, such high speed memories are very
expensive.
• So the general approach is to distribute information over
various memory types that have different performance
and cost.
Principles of Locality
• SPATIAL LOCALITY
• The locality principle stating that if a data location is referenced,
data locations with nearby addresses will tend to be referenced
soon

• TEMPORAL LOCALITY
• The principle stating that if a data location is referenced then it will
tend to be referenced again soon.
MEMORY HIERARCHY
Conceptual Organisation of Multilevel memory

CPU
Register Cache
(level 1) Cache
File
(level 2)
Main Memory
Secondary
IC1( Microprocessor) ICs Memory
2:m
ICs m:n

Hard disk, etc


Memory Types
• CPU registers
• High speed -- Costly
• Working memory for temporary storage of instruction and data.
• Usually General Purpose Registers are used.
• Size : comparatively small like 32 data word.
• Within a clock cycle they can be accessed.
• Main(primary) Memory
• Fairly fast external memory
• Storage addressed by Load and Store instructions
• Generally sizes 1MegaByte. Now a days we have GB.
• 1 MB= 220 bytes and 1 GB= 210 MB.
• Access Time five or more clock cycles.
Continued…
• Secondary Memory
• Larger capacity – Many gigabytes
• Slower
• Acts as overflow memory when capacity of main memory
exceeded.
• Accessed via I/O programs
• Eg: Hard disks, CD-ROM’s etc.
• Cache
• Positioned logically between Registers and Main memory.
• Capacity less than main memory and greater than registers.
• Speed vice versa.
• External Memory  Together “Main Memory and Cache memory”
Continued…
• Random Access Memory

• Sequential Access Memory

• Semi Random Access memory


Understanding RAM and ROM (Video Available in
Youtube)
• Refer to the video uploaded Memory Concepts 1

• Or refer to the link


• https://www.youtube.com/watch?v=ufGRoLOvM9I
Questions for discussion
• When will RAM be used?

• When will ROM be used?


Random Access in RAM

• Refer to the link


• https://www.youtube.com/watch?v=Kav6oOFDQSA
Question for discussion
• How random access possible in RAM??

• How to modify a random access memory to access


sequentially?
Addressing and Data(read/write) in RAM
• Refer to the video uploaded Memory Concepts 2 ( in
MOODLE)

• Or refer to the link


• https://www.youtube.com/watch?v=UaFSsD0LPS8
Questions for discussion
• When a address is given, how only that address is
accessed for read / write?
Basic Structure of Memory hierarchy
Continued…
Continued…
• There are three primary technologies used in building
memory hierarchies.
• Main memory is implemented from DRAM (Dynamic Random
Access Memory)

• Caches use SRAM (Static Random Access Memory)


Continued…
What is SRAM?
• SRAM is a type of RAM that holds data in a static form, that is, as long as the
memory has power

• SRAM stores a bit of data on four transistors using two cross-coupled inverters

• SRAM is best suited for secondary operations like the CPU’s fast cache memory
and storing registers

• SRAM is most often found in hard drives as disc cache

• SRAM is about 10 nanoseconds

• SRAM’s cycle time is a lot shorter than DRAM’s because it does not need to
refresh. The cycle time of SRAM is shorter because it does not need to stop
between accesses to refresh
What is DRAM?
• Dynamic random-access memory (DRAM) is a type of random-
access memory that stores each bit of data in a separate capacitor
within an integrated circuit.

• The capacitor can be either charged or discharged; these two


states are taken to represent the two values of a bit, conventionally
called 0 and 1.

• The capacitors will slowly discharge, and the information


eventually fades unless the capacitor charge is refreshed
periodically

• The main memory (the "RAM") in personal computers is dynamic


RAM (DRAM)
Memory Hierarchy and the components used
• Register Files  fast static RAMs with multiple ports.
Such RAMs are distinguished by having dedicated read
and write ports, whereas ordinary multi-ported SRAMs will
usually read and write through the same ports

• Cache  SRAM with single read/write port

• Main Memory  DRAM

• Secondary memory  Hard disk and CD’s


Upper and Lower level in Memory Hierarchy
Continued…
• Block : The minimum unit of information that can be either
present or not present

• Hit : If the data requested by the processor appears in


some block in the upper level
• L1 Cache hit, L2 cache, Memory hit etc

• Miss: If the data is not found in the upper level


• L1 Cache miss, L2 cache miss, Memory miss etc
• The lower level in the hierarchy is then accessed to retrieve the
block containing the requested data.
Continued…
• Hit rate The fraction of memory accesses found in a level of
memory hierarchy

• Miss Rate : The fraction of memory accesses not found in a level


of the memory hierarchy.

• Hit Time :The time required to access a level of the memory


hierarchy, including the time needed to determine whether the
access is a hit or a miss

• Miss penalty : The time required to fetch a block into a level of the
memory hierarchy from the lower level, including the time to
access the block, transmit it from one level to the other, and insert
it in the level that experienced the miss.
CACHES
Basics of Cache
• Cache was the name chosen to represent the level of the
memory hierarchy between the processor and main
memory

• The term is also used to refer to any storage managed to


take advantage of locality of access.

• Let’s consider a cache with blocks of one word


Cache before and after reference to word Xn
Continued…
• Before the request, Xn is not in the cache.

• This results in a cache miss

• So Xn is brought in to cache memory from main memory


Continued…
• How do we know if a data item is in the cache?

• How do we find it?


Mapping between Cache & Main Memory
• Direct Mapped

• Associative Mapped

• Set Associative
Direct Mapping
• A cache structure in which each memory location is mapped to
exactly one location in the cache.

• Most popular mapping:


(Block address) modulo (Number of cache blocks in the cache)

• This mapping is attractive because if the number of entries in the


cache is a power of two, then modulo can be computed simply by
using the low-order log2 (cache size in blocks) bits of the address

• Hence the cache may be accessed directly with the low-order bits
Example
Continued…
• Each cache location can contain the contents of a number of
different memory locations

• How do we know whether the data in the cache corresponds to


a requested word
• Tags are added for this purpose

• Tags contain the address information required to identify


whether a word in the cache corresponds to the requested
word

• Tag is a field in a table to represent the above information


(a part of cache)
Continued….
• In the previous example, upper 2 bits of the 5 bit address is
used as tag

• Now we know whether it is the requested block or not.

• But how to know whether the requested block in the cache is


valid ? (not corrupt or up to date with respect to a particular
program)

• Valid bits are used for this purpose


• Field in the tables of a memory hierarchy that indicates that the
associated block in the hierarchy contains valid data
• If the bit is not set, there cannot be a match for this block.
Example Continued….
1. For the above example, the block references in main
memory are as follows (in the same order)
1. Initial state of the cache after powered on
Continued…
2.First reference is for 22, so 22 mod 8110 ( where the
data from main memory goes)
(22)10  101102 So tag =10. Valid bit is set.
Continued…
Continued…
Direct mapping hardware
Question???
• How to map if number of blocks in cache is not a power of
2?

• Example:
• No of Cache blocks =9
• No of blocks in main memory = 32
Multi-word cache
• If a cache has 2n blocks and each block has 2m words, the
number of bits required for representing the cache
address will be “n + m”
• n  to address the block with in the cache
• m to address the word within the block

• No. of address bits required for the above cache including


the bytes within words will be “n+m+2”

For byte with


in word (4
bytes so 2 bits
are enough)
Bits in a cache
• The total number of bits needed for a cache is a function of the cache size and the address
size because the cache includes both the storage for the data and the tags.

• So if a cache has 2n blocks and each block has 2m words. If the main memory address is X
bits then tag can be calculated as,

Tag size = X-(n+m+2)

• Total number of bits in the above cache will be


= No. of cache blocks *( (No. of words in a block * 32 )+ Tag size + Valid field size)
= 2n *( (2m * 32) + Tag size + valid field size)
• = 2n *( (2m * 32) + (X-(n+m+2)) + 1 )
Example
• How many total bits are required for a direct-mapped cache with 16 KB of data
and 4-word blocks, assuming a 32-bit address?
• Solution:
• Data in cache 16KB = 4Kwords = 212 words
• No. of words in a block = 4 words = 22 words
• If 212 words has to be represented in a block of 22 words,
No. of blocks in cache = 2 12/22 = 210
• So in the above problem,
• n= 10, m= 2 and X=32 bits
• So tag size =32-10-2-2 =18 bits
• Valid field =1 bit
• Total bits for cache in the given problem will be
• 210 * ((22 * 32)+18 +1)
= 210 *(128+18+1) =147 Kbits
= 18.375 Kbytes

• From the above problem it could be seen that the actual capacity of 16KB data cache is
18.4 KB. Excluding the bits required for addressing, 16KB (actual data storage) will be
specified by most of the naming conventions
Three types of misses in Cache
• Referred to as Three C’s

• Compulsory misses: These are cache misses caused by the first access
to a block that has never been in the cache. These are also called cold-
start misses.

• Capacity misses: These are cache misses caused when the cache cannot
contain all the blocks needed during execution of a program. Capacity
misses occur when blocks are replaced and then later retrieved.

• Conflict misses: These are cache misses that occur in set-associative or


direct-mapped caches when multiple blocks compete for the same set.
Conflict misses are those misses in a direct-mapped or set-associative
cache that are eliminated in a fully associative cache of the same size.
These cache misses are also called collision misses.
Ideal Block Size
• Compulsory misses reduces with increased block size

• So how big a block can be?


• The miss rate may go up eventually if the block size becomes a
significant fraction of the cache size because it may result in
Capacity Miss
• In addition Miss penalty also increases with increased block size

• So blocks should be ideally big enough to reduce


compulsory misses and not very large to contain capacity
miss
Continued…
Early Restart
• Miss penalty cannot be reduced but can be masked with early restart

• Early Restart: resume execution as soon as the requested word of


the block is returned , rather than wait for the entire block.

• Early restart is particularly useful with instruction access and less


effective with data cache.
• Because instruction fetch is strictly sequential (at-least in “in-order” and “out-of-
order” pipeline”)

• Requested word first: More sophisticated than early restart. Word


requested is first transferred to cache. The rest of the block is
transferred later
Handling Cache Misses
• Cache Miss: A request for data from the cache that cannot
be filled because the data is not present in the cache.

• The control unit must detect a miss and process the miss
by fetching the requested data from memory

• Let the memories in the data path discussed be changed


to caches (instruction cache and data cache)

• Modifying the control of a processor to handle a hit is


trivial
Continued…
• The cache miss handling is done with the processor
control unit

• A separate controller initiates the memory access and


refills the cache

• The processing of a cache miss creates a stall (no op)


When a Cache miss occurs…
• Stall the entire processor

• Freeze the contents of the temporary and programmer


visible registers
• What are programmer invisible registers?
Steps to perform with instruction miss
• If an instruction access results in a miss, then the content
of the Instruction register is invalid

• To get the proper instruction into the cache, we must be


able to instruct the lower level in the memory hierarchy to
perform a read.

• The address of the instruction that generates an


instruction cache miss is equal to the value of the program
counter minus 4 (since PC was increased by 4 during the
instruction fetch and identifying it is a miss)
Continued…
1. Send the original PC value (current PC – 4) to the
memory.

2. Instruct main memory to perform a read and wait for the


memory to complete its access.

3. Write the cache entry, putting the data from memory in the
data portion of the entry, writing the upper bits of the address
(from the ALU) into the tag field, and turning the valid bit on.

4. Restart the instruction execution at the first step, which


will re-fetch the instruction, this time finding it in the cache.
When there is Data Miss
• Data miss happens with load or store

• After EA is calculated and it is a miss in the cache, stall


memory access to fetch the required data
Handling Writes
• When there is a write to the cache (some address), How
to maintain consistency between cache and Memory?
• Write Through Cache : Always write the data into both the memory
and the cache

• When there is a write miss, (Write Through with no buffer)


• Fetch the words of the block from memory
• After the block is fetched and placed into the cache we can
overwrite the word that caused the miss into the cache block
• We also write the word to main memory.
Continued…
• But write through with no buffer takes more time
• To improve performance for write miss and provide
solution to the above problem,
• Use Write Buffer : queue that holds data while the data are waiting
to be written to memory

• After writing the data into the cache and into the write
buffer, the processor can continue execution

• When a write to main memory completes, the entry in the


write buffer is freed.
Continued…
• If the write buffer is full when the processor reaches a
write, the processor must stall until there is an empty
position in the write buffer

• Problems with Write Buffer:


• If the memory writes from buffer are slow then the buffer may not
help much

• Even if the write rate is less, stalls may still exist, if the writes
occurs in bursts
Continued…
• Solution to the above problem:
• Use of Write back cache

• In a write-back scheme, when a write occurs, the new value is written only
to the block in the cache.

• The modified block is written to the lower level of the hierarchy when it is
replaced

• Write-back schemes can improve performance, especially when processors


can generate writes as fast or faster than the writes can be handled by
main memory

• A write-back scheme is more complex to implement than write-through.


Designing Memory System to support Caches
• Cache misses are satisfied from main memory, which is constructed
from DRAMs.

• DRAMs are designed with the primary emphasis on density rather


than access time

• It is difficult to reduce the latency to fetch the first word from memory

• But we can reduce the miss penalty if we increase the bandwidth


from the memory to the cache.

• This reduction allows larger block sizes to be used ( with reduced


miss penalty)
Design to increase bandwidth of main memory
• The processor is typically connected to memory over a
bus

• The clock rate of the bus is usually much slower than the
processor, by as much as a factor of 10.

• The speed of this bus affects the miss penalty.


Continued…
• Widening the memory and the buses between the
processor and memory

(or)

• widening the memory but not the interconnection bus


One word wide organization
Example
• Assume
■ 1 memory bus clock cycle to send the address
■ 15 memory bus clock cycles for each DRAM access
initiated
■ 1 memory bus clock cycle to send a word of data

If we have a cache block of four words and a one-word-


wide bank of DRAMs then

Miss Penalty = 1 + 4*15+4*1 = 65 mem.clk cycles


Wide Memory Organization
Continued…
• Allows parallel access to all words of the block

• So if bus is 4 words wide and bank is also 4 word wide

• Miss Penalty = 1 + 1*15 + 1 =17 mem. Clocks

• Draw back:
• Cost overhead due to increased bus width and control logic to
select mux to write in to appropriate word
Interleaved Memory Organization
Continued…
• So if each bank is 4 word wide and still the bus is 1 word
wide the miss penalty would be,

• Miss Penalty = 1 + 15 + 4*1 = 20 mem clk cycles


Split Cache
• A scheme in which a level of the memory hierarchy is
composed of two independent caches that operate in
parallel with each other with one handling instructions and
one handling data.
Measuring Cache Performance
• CPU time = (CPU execution clock cycles + Memory-stall clock
cycles) * Clock cycle time

• Memory-stall clock cycles = Read-stall cycles + Write-stall cycles

• Read-stall cycles = (Reads /Program) * Read miss rate *Read


miss penalty

• For a write-through scheme


• write misses, which usually require that we fetch the block before continuing
the write
• write buffer stalls, which occur when the write buffer is full when a write occurs
Continued…
• Write-stall cycles = (Writes/Program) * Write miss rate
*Write miss penalty + Write Buffer Stalls
Fully Associative Mapping
• Cache structure in which a block can be placed in any location
in the cache.

• To find a given block in a fully associative cache, all the entries


in the cache must be searched because a block can be placed
in any one.

• To make the search practical, it is done in parallel with a


comparator associated with each cache entry

• These comparators significantly increase the hardware cost,


effectively making fully associative placement practical only for
caches with small numbers of blocks.
SET ASSOCIATIVE
• A cache that has a fixed number of locations (at least two)
where each block can be placed.

• (Block number) modulo (Number of sets in the cache)


Direct Mapped or One-Way Set Associative
Two Way Set Associative
Four Way Set Associative
8-Way Set Associative
Locating a block in the Set Associative cache

• Tag : To see if it matches the block address from the


processor

• The index value is used to select the set containing the


address of interest

• Because speed is of the essence, all the tags in the


selected set are searched in parallel
Continued…
• If the total cache size is kept the same,
• Increasing the associativity increases the number of blocks per set
• This increases the number of simultaneous compares needed to
perform the search in parallel
No. of Comparators needed
• In direct mapped cache,
• 1 comparator needed

• In m-way set associative,


• m comparators needed
• m to 1 mux is also needed

• In fully associative
• As many comparators as there are blocks
Implementation of 4-way set associative cache
Replacement in Fully Associative and Set
Associative Caches
• What to be replaced when there is a MISS in Set-
Associative / Fully Associative Cache
Eg: Consider 16 blocks grouped in to 4 sets in 4-way set
associative cache. Each block contains 1 byte.
4-Way Set Associative

Set 0 Block 0 Block 1 Block 2 Block 3


Block 0 Block 1 Block 2 Block 3
Set 1
Block 0 Block 1 Block 2 Block 3
Set 2
Block 0 Block 1 Block 2 Block 3
Set 3
Continued…
• If references are like, 0 ,1, 2, 3,7, 16, 20, 24
• Since in question max reference “19” we consider virtual address
of size “5”. Out of “5” bits last “2” bits for block reference
1. 000000 {goes to Set 0, any block}(Block 0)
2. 100001{Set 0;Any block except 0}(Block 1)
3. 200010{Set 0; Any block except 0,1}(Block 2)
4. 300011{Set 0;Any Block except 0,1,2}(Block 3)
5. 700111{Set 0;Any Block} Replacement Needed
What to replace?
• Many algorithms Exist  LRU Popular
• LRU: Least Recent Used
• So in the above scenario replace Block 0
Virtual Memory
• A virtual memory block is called a page, and a virtual
memory miss is called a page fault

• With virtual memory, the processor produces a virtual


address, which is translated by a combination of hardware
and software to a physical address, which in turn can be
used to access main memory.

• This process is called address mapping or address


translation
Mapping from Virtual Address to Physical Address
Page Table
• The table containing the virtual to physical address translations in a
virtual memory system.

• The table, which is stored in memory, is typically indexed by the virtual


page number

• Each entry in the table contains the physical page number for that virtual
page if the page is currently in memory.

• Page table register : To indicate the location of the page table in memory,
the hardware includes a register that points to the start of the page table

• Assume for now that the page table is in a fixed and contiguous area of
memory.
TLB
• A cache that keeps track of recently used address
mappings to avoid an access to the page table.
Error Detection Code  Parity bit
• When a word is written into memory, the parity bit is also
written
• Then, when the word is read out, the parity bit is read and
checked.
• If the parity of the memory word and the stored parity bit
do not match, an error has occurred
• A 1-bit parity scheme can detect at most 1 bit of error in a
data item
• If there are 2 bits of error, then a 1-bit parity scheme will
not detect any errors, since the parity will match the data
with two errors
What is a code word?
• An n-bit unit containing data and check-bits is often
referred to as an n-bit codeword
Continued…
• In the above table to move from one code word to another
code word, a minimum distance (hamming distance) of 2 is
needed

• A parity code cannot tell which bit in a data item is in error

• A 1-bit parity scheme is an error detection code (EDC)

• To correct errors ECC is used

• A 1-bit parity code is a distance-2 code


• There is a distance of two between legal combinations of parity and data
Error Correcting Code
• These codes work by using more bits to encode the data

• To detect more than one error or correct an error, we need


a distance-3 code
Hamming Code
• Hamming code contains redundant bits (r) and data bits
(d)

• To calculate the numbers of redundant bits (r) required to


correct d data bits :
• Total no.of bits to be transmitted  d+r
• So r must be able to indicate atleast d+r+1 different values
• 2r >= d+r+1

• Example:
• If d is 7, smallest value of r that satisfies above relation is 4
Continued…
Example
• Let us consider 4 data bits. So r will be 3

• Let the data to be sent is 1010 (even parity)

• So d3 d2 d1 r2 r1
d4 r4
1 0 1 0
1 0 1 0 0 1 0

After adding
redundant bits to be
‘0’
Continued…
• If received data is
1110010

C1 : To be calculated from bit positions 1, 3, 5, 7 parity


0
C2: 2,3,6,7 bit position parity 1
C3: 4,5,6,7 bit position parity  1
So C3C2C1110  6
Error in bit position 6
Continued…
• Basic approach for error detection and correction by using
Hamming code is as follows:
• To each group of m information bits k parity bits are added to form
(m+k) bit code
• Location of each of the (m+k) digits is assigned a decimal value
• The k parity bits are placed in positions 1, 2, …, 2k-1 positions
• K parity checks are performed on selected digits of each code-
word
• At the receiving end the parity bits are recalculated. The decimal
value of the k parity bits provides the bit-position in error, if any

You might also like