Computer Organization and Architecture
Designing for Performance
11th Edition
                                          Chapter 5
                                          Cache Memory
               Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.1
Cache and Main Memory
          Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Cache Memory Principles
• Block
  – The minimum unit of transfer between cache and main memory
• Frame
  – To distinguish between the data transferred and the chunk of physical
    memory, the term frame, or block frame, is sometimes used with reference
    to caches
• Line
  – A portion of cache memory capable of holding one block, so-called
    because it is usually drawn as a horizontal object
• Tag
  – A portion of a cache line that is used for addressing purposes
• Line size
  – The number of data bytes, or block size, contained in a line
                    Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.2
Cache/Main Memory Structure
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.3
Cache Read Operation
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.4
Typical Cache Organization
            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Table 5.1
Elements of Cache Design
Cache Addresses                               Write Policy
  Logical                                       Write through
  Physical                                      Write back
Cache Size                                    Line Size
Mapping Function                              Number of Caches
  Direct                                        Single or two level
  Associative                                   Unified or split
  Set associative
Replacement Algorithm
  Least recently used (LRU)
  First in first out (FIFO)
  Least frequently used (LFU)
  Random
                    Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Cache Addresses
                   •Virtual Memory
• Virtual memory
  – Facility that allows programs to address memory from a
    logical point of view, without regard to the amount of main
    memory physically available
  – When used, the address fields of machine instructions
    contain virtual addresses
  – For reads to and writes from main memory, a hardware
    memory management unit (MMU) translates each virtual
    address into a physical address in main memory
                    Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.5
Logical and Physical Caches
            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Cache Size
• Preferable for the size of the cache to be:
   – Small enough so that the overall average cost per bit is close to that of
     main memory alone
   – Large enough so that the overall average access time is close to that of
     the cache alone
▪ Motivations for minimizing cache size:
   – The larger the cache, the larger the number of gates involved in
     addressing the cache resulting in large caches being slightly slower than
     small ones
   – The available chip and board area also limits cache size
▪ Because the performance of the cache is very sensitive to the
  nature of the workload, it is impossible to arrive at a single
  “optimum” cache size
                       Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Table 5.2
Cache Sizes of Some Processors
                                     Year of
   Processor          Type        Introduction      L1 Cachea           L2 cache      L3 Cache
   IBM 360/85       Mainframe        1968           16 to 32 kB            –             –
   PDP-11/70      Minicomputer       1968              1 kB                –             –
   IBM 3033         Mainframe        1968             64 kB                –             –
   IBM 3090         Mainframe        1968          128 to 256 kB           –             –
   Intel 80486         PC            1968              8 kB                –             –
    Pentium            PC            1968            8 kB/8 kB        256 to 512 kB      –
  PowerPC 620          PC            1968          32 kB/32 kB             –             –
  IBM S/390 G6      Mainframe        1968             256 kB              8 MB           –
   Pentium 4        PC/server        1968            8 kB/8 kB           256 kB          –
    Itanium         PC/server        1968          16 kB/16 kB           96 kB          4 MB
    Itanium 2       PC/server        1968             32 kB              256 kB         6 MB
  IBM POWER5        High-end         1968             64 kB              1.9 MB        36 MB
                     server
  CRAY XD-1       Supercomputer      1968          64 kB/64 kB            1 MB           –
  IBM POWER6        PC/server        1968          64 kB/64 kB            4 MB         32 MB
    IBM z10         Mainframe        1968          64 kB/128 kB           3 MB        24-48 MB
  Intel Core i7    Workstaton/       1968         6 × 32 kB/32 kB      6 × 1.5 MB      12 MB      a
                                                                                                   Two values separated by a
     EE 990         Server
                                                                                                  slash refer to instruction and
                                                                                                  data caches.
      IBM          Mainframe/        1968        24 × 64 kB/128 kB    24 × 1.5 MB      24 MB L3
   zEnterprise      Server                                                            192 MB L4
      196
    IBM z13        Mainframe/        1968        24 × 96 kB/128 kB   24 × 2 MB/2 MB    64 MB L3
                    server                                                            480 MB L4   (Table can be found on page
   Intel Core      Workstation/      1968         8 × 32 kB/32 kB       8 × 1 MB       14 MB      145 in the textbook.)
    i0-7900X         server
                                       Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
 Table 5.3
 Cache Access Methods
                                             Mapping of Main Memory                    Access using Main
    Method             Organization             Blocks to Cache                         Memory Address
 Direct Mapped      Sequence of m          Each block of main memory       Line portion of address used
                    lines                  maps to one unique line of      to access cache line; Tag
                                           cache.                          portion used to check for hit
                                                                           on that line.
Fully Associative   Sequence of m          Each block of main memory       Tag portion of address used
                    lines                  can map to any line of cache.   to check every line for hit on
                                                                           that line.
Set Associative     Sequence of m          Each block of main memory       Line portion of address used to
                    lines organized as v   maps to one unique cache set.   access cache set; Tag portion
                    sets of k lines each                                   used to check every line in that
                    (m = v × k)                                            set for hit on that line.
                                    Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.6
Mapping from Main Memory to Cache: Direct and
Associative
               Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
  Direct Mapping: M mod N
                                                            Block/Line = 4 = 22    Block size = 4 = 22
32 words in Main Memory        32 = 25 =           Tag    Cache Block/Line #         Word Offset
16 words in Cache Memory       physical address
                                                  1 bit          2 bits                  2 bits
Block size = 4 words
                                                                         00 – 1
                                                               0         01 – 2
                                                                         10 – 3   0 mod 4 = 0
                                                                         11 – 4
             0   1                  0, 4
                      0                                        1
                                    1, 5
                                                               2                  2 mod 4 = 2
  Tag                 1
                                                               3
                           00 – 1
                 0                  2, 6
                      2    01 – 2
                           10 – 3
                           11 – 4
                                    3, 7                       4                  4 mod 4 = 0
                      3
                     Cache Memory                              5
                     # blocks (N) = 4
                                                               6
MM blocks that map to CM block 0 are: 0, 4                     7
MM blocks that map to CM block 1 are: 1, 5
                                                              Main Memory
MM blocks that map to CM block 2 are: 2, 6                    # blocks (M) = 8
MM blocks that map to CM block 3 are: 3, 7
Figure 5.7
Direct-Mapping Cache Organization
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.8
Direct Mapping Example
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Content-Addressable Memory (CAM)
•   Also known as associative storage
•   Content-addressable memory is constructed of Static RAM (SRAM)
    cells but is considerably more expensive and holds much less data
    than regular SRAM chips
•   A CAM with the same data capacity as a regular SRAM is about
    60% larger
•   A CAM is designed such that when a bit string is supplied, the CAM
    searches its entire memory in parallel for a match
    – If the content is found, the CAM returns the address where the match
      is found and, in some architectures, also returns the associated data
      word
    – This process takes only one clock cycle
                       Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.9
Content-Addressable Memory
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
  Fully Associative Cache
                            32 = 25 =       Total = 5 bits
32 words in Main Memory     physical address
                                                                  Tag           Word Offset
16 words in Cache Memory                                          3 bits         2 bits
Block size = 4 words
                                                             0
              000
                      0                                      1
  Tag
              010
              000
                      1                                      2
              111
              000
                      2                                      3
              000
                      3                                      4
                     Cache Memory                            5
                     # blocks (N) = 4
                                                             6
                                                             7
                                                             Main Memory
                                                             # blocks (M) = 8
                              Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.10
Fully Associative Cache Organization
            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.11
Associative Mapping Example
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Set Associative Mapping
• Compromise that exhibits the strengths of both the direct
  and associative approaches while reducing their
  disadvantages
• Cache consists of a number of sets
• Each set contains a number of lines
• A given block maps to any line in a given set
• e.g. 2 lines per set
  – 2 way associative mapping
  – A given block can be in one of 2 lines in only one set
                    Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
  Set Associative Cache
                         32 = 25 =
32 words in Main Memory physical address        Tag           Set field              Word Offset
16 words in Cache Memory    Total = 5 bits     2 bits              1 bit               2 bits
Block size = 4 words
                                                              0                  0 mod 2 = 0
                                      Set 0
              00
                      0                                       1
              01
                      1                                       2                  2 mod 2 = 0
  Tag
              11                      Set 1                   3                  3 mod 2 = 1
                      2
              01
                      3                                       4                  4 mod 2 = 0
                      Cache Memory                            5
                      # blocks (N) = 4
                                                              6
                                                              7                  7 mod 2 = 1
                                                             Main Memory
                                                             # blocks (M) = 8
                                Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.13 k-Way
Set Associative Cache Organization
            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.14
Two-Way Set-Associative Mapping Example
            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.15
Varying Associativity over Cache Size
            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Replacement Algorithms
• Once the cache has been filled, when a new block is
  brought into the cache, one of the existing blocks must
  be replaced
• For direct mapping there is only one possible line for any
  particular block and no choice is possible
• For the associative and set-associative techniques a
  replacement algorithm is needed
• To achieve high speed, an algorithm must be
  implemented in hardware
                  Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
The most common replacement algorithms
are:
•   Least recently used (LRU)
    – Most effective
    – Replace that block in the set that has been in the cache longest with no
      reference to it
    – Because of its simplicity of implementation, LRU is the most popular
      replacement algorithm
•   First-in-first-out (FIFO)
    – Replace that block in the set that has been in the cache longest
    – Easily implemented as a round-robin or circular buffer technique
•   Least frequently used (LFU)
    – Replace that block in the set that has experienced the fewest references
    – Could be implemented by associating a counter with each line
                        Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
 Write Policy
When a block that is resident
    in the cache is to be                            There are two problems to
replaced there are two cases                               contend with:
         to consider:
 If the old block in the cache has not
      been altered, then it may be                    More than one device may have
overwritten with a new block without                     access to main memory
      first writing out the old block
                                                      A more complex problem occurs
  If at least one write operation has                  when multiple processors are
been performed on a word in that line               attached to the same bus and each
of the cache then main memory must                  processor has its own local cache - if
   be updated by writing the line of               a word is altered in one cache it could
  cache out to the block of memory                    conceivably invalidate a word in
   before bringing in the new block                             other caches
                         Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Write Through
         and Write Back
• Write through
  – Simplest technique
  – All write operations are made to main memory as well as to the cache
  – The main disadvantage of this technique is that it generates substantial
    memory traffic and may create a bottleneck
• Write back
  – Minimizes memory writes
  – Updates are made only in the cache
  – Portions of main memory are invalid and hence accesses by I/O modules
    can be allowed only through the cache
  – This makes for complex circuitry and a potential bottleneck
                      Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Write Miss Alternatives
•   There are two alternatives in the event of a write miss at a cache
    level:
    – Write allocate
      –   The block containing the word to be written is fetched from main memory (or
          next level cache) into the cache and the processor proceeds with the write
          cycle
    – No write allocate
      –   The block containing the word to be written is modified in the main memory
          and not loaded into the cache
•   Either of these policies can be used with either write through or
    write back
•   No write allocate is most commonly used with write through
•   Write allocate is most commonly used with write back
                          Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Cache Coherency
•   A new problem is introduced in a bus organization in which more than one device has
    a cache and main memory is shared
•   If data in one cache are altered, this invalidates not only the corresponding word in
    main memory, but also that same word in other caches
•   Even if a write-through policy is used, the other caches may contain invalid data
•   Possible approaches to cache coherency include:
    –   Bus watching with write through
        •   Each cache controller monitors the address lines to detect write operations to memory by other bus
            masters
        •   If another master writes to a location in shared memory that also resides in the cache memory, the cache
            controller invalidates that cache entry
        •   This strategy depends on the use of a write-through policy by all cache controllers
    –   Hardware transparency
        •   Additional hardware is used to ensure that all updates to main memory via cache are reflected in all caches
        •   If one processor modifies a word in its cache, this update is written to main memory
    –   Noncacheable memory
        •   Only a portion of main memory is shared by more than one processor, and this is designated as noncacheable
        •   All accesses to shared memory are cache misses, because the shared memory is never copied into the cache
        •   The noncacheable memory can be identified using chip-select logic or high-address bits
                                   Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Line Size
 When a block of                                                                    Two specific effects
 data is retrieved                                                                  come into play:
 and placed in the                                                                  • Larger blocks reduce the
cache not only the                           As the block size                        number of blocks that fit
 desired word but                             increases more                          into a cache
                                                                                    • As a block becomes larger
also some number                               useful data are                        each additional word is
of adjacent words                            brought into the                         farther from the requested
   are retrieved                                   cache                              word
                      As the block size                             The hit ratio will
                      increases the hit                           begin to decrease
                       ratio will at first                            as the block
                     increase because                              becomes bigger
                     of the principle of                          and the probability
                            locality                              of using the newly
                                                                 fetched information
                                                                 becomes less than
                                                                   the probability of
                                                                       reusing the
                                                                    information that
                                                                  has to be replaced
                                      Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Multilevel Caches
•   As logic density has increased it has become possible to have a cache on the same chip as
    the processor
•   The on-chip cache reduces the processor’s external bus activity and speeds up execution
    time and increases overall system performance
    –   When the requested instruction or data is found in the on-chip cache, the bus access is
        eliminated
    –   On-chip cache accesses will complete appreciably faster than would even zero-wait state bus
        cycles
    –   During this period the bus is free to support other transfers
•   Two-level cache:
    –   Internal cache designated as level 1 (L1)
    –   External cache designated as level 2 (L2)
•   Potential savings due to the use of an L2 cache depends on the hit rates in both the L1 and
    L2 caches
•   The use of multilevel caches complicates all of the design issues related to caches, including
    size, replacement algorithm, and write policy
                                Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.16
Total Hit Ratio (L1 and L2) for 8-kB and
16-kB L1
                Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Unified Versus Split Caches
•   Has become common to split cache:
    – One dedicated to instructions
    – One dedicated to data
    – Both exist at the same level, typically as two L1 caches
•   Advantages of unified cache:
    – Higher hit rate
       ▪ Balances load of instruction and data fetches automatically
       ▪ Only one cache needs to be designed and implemented
•   Trend is toward split caches at the L1 and unified caches for
    higher levels
•   Advantages of split cache:
    – Eliminates cache contention between instruction fetch/decode unit and execution unit
       ▪ Important in pipelining
                            Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Inclusion Policy
•   Inclusive policy
    –   Dictates that a piece of data in one cache is guaranteed to be also found in all lower levels of caches
    –   Advantage is that it simplifies searching for data when there are multiple processors in the computing
        system
    –   This property is useful in enforcing cache coherence
•   Exclusive policy
    –   Dictates that a piece of data in one cache is guaranteed not to be found in all lower levels of caches
    –   The advantage is that it does not waste cache capacity since it does not store multiple copies of the
        same data in all of the caches
    –   The disadvantage is the need to search multiple cache levels when invalidating or updating a block
    –   To minimize the search time, the highest-level tag sets are typically duplicated at the lowest cache
        level to centralize searching
•   Noninclusive policy
    –   With the noninclusive policy a piece of data in one cache may or may not be found in lower levels of
        caches
    –   As with the exclusive policy, this policy will generally maintain all higher-level cache sets at the lowest
        cache level
                                 Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 1. Inclusive Policy
Figure 2. Exclusive policy
Table 5.4
Intel Cache Evolution
                                                                                           Processor on Which
      Problem                                                   Solution                  Feature First Appears
      External memory slower than the             Add external cache using faster                 386
      system bus.                                 memory technology.
      Increased processor speed results in        Move external cache on-chip,                    486
      external bus becoming a bottleneck for      operating at the same speed as the
      cache access.                               processor.
      Internal cache is rather small, due to      Add external L2 cache using faster              486
      limited space on chip.                      technology than main memory.
      Contention occurs when both the             Create separate data and                      Pentium
      Instruction Prefetcher and the              instruction caches.
      Execution Unit simultaneously require
      access to the cache. In that case, the
      Prefetcher is stalled while the Execution
      Unit’s data access takes place.
                                                  Create separate back-side bus that          Pentium Pro
                                                  runs at higher speed than the main
                                                  (front-side) external bus. The BSB is
      Increased processor speed results in        dedicated to the L2 cache.
      external bus becoming a bottleneck for
      L2 cache access.
                                                  Move L2 cache on to the                      Pentium II
                                                  processor chip.
                                                  Add external L3 cache.                       Pentium III
      Some applications deal with massive
      databases and must have rapid access
                                                  Move L3 cache on-chip.                       Pentium 4
      to large amounts of data. The on-Chip
      caches are too small.
                             Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Figure 5.17
Pentium 4 Block Diagram
           Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Copyright
    This work is protected by United States copyright laws and is provided solely
      for the use of instructions in teaching their courses and assessing student
       learning. dissemination or sale of any part of this work (including on the
          World Wide Web) will destroy the integrity of the work and is not permit-
           ted. The work and materials from it should never be made available to
             students except by instructors using the accompanying text in their
              classes. All recipients of this work are expected to abide by these
    restrictions and to honor the intended pedagogical purposes and the needs of
    other instructors who rely on these materials.
                     Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved