KEMBAR78
09 Memory Building Blocks | PDF | Random Access Memory | Computer Science
0% found this document useful (0 votes)
27 views34 pages

09 Memory Building Blocks

The document discusses SRAM technology, detailing its characteristics, advantages, and various implementations in computer architecture. It explains the differences between SRAM and latches, the structure of SRAM cells, and the operation of read and write ports. Additionally, it covers topics such as multi-porting, banking, and the use of content associative memory (CAM) in cache designs, emphasizing the importance of latency and power considerations in SRAM design.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views34 pages

09 Memory Building Blocks

The document discusses SRAM technology, detailing its characteristics, advantages, and various implementations in computer architecture. It explains the differences between SRAM and latches, the structure of SRAM cells, and the operation of read and write ports. Additionally, it covers topics such as multi-porting, banking, and the use of content associative memory (CAM) in cache designs, emphasizing the importance of latency and power considerations in SRAM design.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Advanced Computer Architecture I

Addresses
SRAM Technology
• SRAM: static RAM
• Static: bits directly connected to power/ground
• Naturally/continuously “refreshed”, never decay (contrast DRAM)
• Designed for speed

• Implements all storage arrays in real processors


• Register file, caches, branch predictor, etc.
• Everything except pipeline latches

• Latches vs. SRAM


• Latches: singleton word, always read/write same one
• SRAM: array of words, can read/write different ones
• Address indicates which one

4
(CMOS) Memory Components

• Interface
address ! data • N-bit address bus (on N-bit machine)
• Data bus
M • Can have read/write on same data bus
• Or, can have dedicated read/write buses
• Can have multiple ports: address/data bus pairs

5
SRAM: First Cut
write-data1 write-data0 • 4x2 (4 2-bit words) RAM
wr ite-addr

• 2-bit addr
0 0
• First cut: bits are D-Latches
• Write port
1 1 • Addr decodes to enable signals
• Read port
1 0 • Addr decodes to mux selectors
read-addr

– 1024 input OR gate?


– Physical layout of output wires
0 1 • RAM width ∞ M
• Wire delay ∞ wire length

read-data1 read-data0

6
SRAM: Second Cut
write-data1 write-data0 • Second cut: tri-state wired-OR
wr ite-addr

• Read mux using tri-states


0 0
+ Scalable, distributed “muxes”
+ Better layout of output wires
1 1 • RAM width independent of M

• Standard RAM
read-addr一

1 0
• Bits in word connected by wordline
• 1-hot decode address
0 1
• Bits in position connected by bitline
• Shared input/output wires
• Port: one set of wordlines/bitlines
• Grid-like design
read-data1 read-data0

7
SRAM: Third Cut
IN
• Third cut: replace latches with …
WE OUT – 28 transistors per bit
• Cross-coupled inverters (CCI)
+ 4 transistors
• Convention
• Right node is bit, left is ~bit
~bit bit • Non-digital interface
IN? OUT? OUT? IN? • What is the input and output?
• Where is write enable?
• Implement ports in “analog” way
• Transistors, not full gates

8
SRAM: Register Files and Caches
• Two different SRAM port styles
• Regfile style
• Modest size: <4KB
• Many ports: some read-only, some write-only
• Write and read both take half a cycle (write first, read second)
• Cache style
• Larger size: >8KB
• Few ports: read/write in a single port
• Write and read can both take full cycle

9
Regfile-Style Read Port
CLK • Two phase read
• Phase I: clk = 0
• Pre-charge bitlines to 1
0 1 • Negated bitlines are 0

w ordline0 • Phase II: clk = 1


• One wordline goes high
raddr
/

• All “1” bits in that row


1 0 discharge their bitlines to 0
• Negated bitlines go to 1
wordline1
bitline1

bitline0

rdata1 rdata0

10
Read Port In Action: Phase I
CLK=0 • CLK = 0
• p-transistors conduct
• Bitlines “pre-charge” to 1
0 1 • rdata1-0 are 0
raddr

1 0

1 1
/ /
rdata1 0 rdata0 0

11
Read Port In Action: Phase II
CLK=1 • raddr = 1
• CLK = 1
• p-transistors close
0 1
• wordline1 = 1
• “1” bits on wordline1 create path
from bitline to ground
raddr

• SRAM[1]
1 0
• Corresponding bitlines discharge
• bitline1
• Corresponding rdata bits go to 1
0 1 • rdata1
/

rdata1 1 rdata0 0 • That’s a read

12
Regfile-Style Write Port
wdata1 wdata0
CLK • Two phase write
• Phase I: clk = 1
• Stabilize one wordline high
0 1 • Phase II: clk = 0
• Open pass-transistors
waddr

• “Overwhelm” bits in selected word


• Actually: two clocks here
1 0 • Both phases in first half

-因 pass transistor: like a tri-state buffer

13
A 2-Read Port 1-Write Port Regfile
wdata1 wdata0
CLK

0 1
RD
RS1

SRAM cell
RS2
1 0

rdata11 rdata21 rdata10


rdata20 14
Cache-Style Read/Write Port
~wdata1 Double-ended bitlines
WE&~CLK wdata1~wdata0wdata0• • Connect to both sides of bit
• Two-phase write
RE&CLK
• Just like a register file
• Two phase read
RE&~CLK || 0 1
WE&CLK • Phase I: clk = 1
• Equalize bitline pair voltage
• Phase II: clk = 0
addr 1 1 • One wordline high
• “1 side” bitline swings up
• “0 side” bitline swings down
sense-amplifier sense-amplifier • Sens-amp translates swing

rdata1 rdata0
15
Read/Write Port in Read Action: Phase I
• Phase I: clk = 1
• Equalize voltage on bitline pairs
• To (nominally) 0.5
RE&CLK

0 1
RE&~CLK
addr

1 1

~bit bit
0.5 0.5 0.5 0.5
sense-amplifier sense-amplifier

rdata1 rdata0
16
Read/Write Port in Read Action: Phase II
• Phase II: clk = 0
• wordline1 goes high
• “1 side” bitlines swing high 0.6
RE&CLK • “0 side” bitlines swing low 0.4
• Sens-amps interpret swing
0 1
RE&~CLK
addr

1 0

~bit bit
0.4 0.6 0.6 0.4
sense-amplifier sense-amplifier
1 0
rdata1 rdata0
17
Cache-Style SRAM Latency
N
• Assume
• M N-bit words 0 1
M
• Some minimum wire spacing L 1 0
• CCIs occupy no space
sa sa
• 4 major latency components: taken in series
• Decoder: ∞ log2M
• Wordlines: ∞ 2NL (cross 2N bitlines)
• Bitlines: ∞ ML (cross M wordlines)
• Muxes + sense-amps: constant
• 32KB SRAM: red components contribute about equally

• Latency: ∞ (2N+M)L
• Maximize storage for some max latency: make SRAMs as
square as possible: minimize 2N+M
• Latency: ∞ √#total bits

18
Multi-Ported Cache-Style SRAM Latency
• Previous calculation had hidden constant
0 1
• Number of ports P
1 0
• Recalculate latency components
• Decoder: ∞ log2M (unchanged) sa sa

• Wordlines: ∞ 2NLP (cross 2NP bitlines)


• Bitlines: ∞ MLP (cross MP wordlines)
• Muxes + sense-amps: constant (unchanged)

0 1
• Latency: ∞ (2N+M)LP
• Latency: ∞ √#bits * #ports 1 0

• How does latency scale? (P)


sa sa
• How does power scale? (P^2)
sa sa
• Both length and number active increase
19
Multi-Porting an SRAM
• Why multi-porting?
• Multiple accesses per cycle
• True multi-porting (physically adding a port) not good
+ Any combination of accesses will work
– Increases access latency, energy ∞ P, area ∞ P2
• Another option: pipelining
• Timeshare single port on clock edges (wave pipelining: no latches)
+ Negligible area, latency, energy increase
– Not scalable beyond 2 ports
• Yet another option: replication
• Don’t laugh: used for register files, even caches (Alpha 21164)
• Smaller and faster than true multi-porting 2*P2 < (2*P)2
+ Adds read bandwidth, any combination of reads will work
– Doesn’t add write bandwidth, not really scalable beyond 2 ports

20
Banking an SRAM
1020 1021
• Divide SRAM into banks, 1022 1023

interleave the addresses


• Allow parallel access to
different banks
• Two accesses to same bank? ↓
bank-conflict, one waits
• Low area, latency overhead for routing requests to banks
• Few bank conflicts given sufficient number of banks
• Rule of thumb: N simultaneous accesses → 2 N banks

• How to divide words among banks?


• Round robin: using address LSB (least significant bits)
• Example: 16 word RAM divided into 4 banks
• b0: 0,4,8,12; b1: 1,5,9,13; b2: 2,6,10,14; b3: 3,7,11,15
• Why? Spatial locality

21
Full-Associativity with CAMs
CAM

• CAM: content associative memory = 0


= 1
• Array of words with built-in comparators
• Matchlines instead of bitlines
• Output is “one-hot” encoding of match = 1022
= 1023

• FA cache?
• Tags as CAM [31:2] 1:0

• Data as RAM
tag Cache hit data

• Hardware is not software


• No such thing as software CAM

22
CAM Circuit
match1 ~match1 match0 ~match0
CLK
• CAM: reverse RAM
• Bitlines are inputs
0 1
• Called matchlines
• Wordlines are outputs
• Two phase match
1 0 • Phase I: clk=0
• Pre-charge wordlines
• Phase II: clk=1
• Enable matchlines
• Non-matching bits
dis-charge wordlines

23
CAM Circuit In Action: Phase I
match1 ~match1 match0 ~match0
CLK 0 1 1 0
• Phase I: clk=0
• Pre-charge wordlines
0 1

1 0

24
CAM Circuit In Action: Phase II
match1 ~match1 match0 ~match0
CLK 0 1 1 0
• Phase II: clk=1
• Enable matchlines
0 1 • Note: bits flipped
1 • Non-matching bit
discharges wordline

1 0 • ANDs matches
• NORs non-matches
0 • Similar technique for
doing a fast OR for hit
detection

~bit bit

25
CAM Upshot
• CAMs: effective but expensive
– Matchlines are very expensive (for nasty EE reasons)
• Used but only for 16 or 32 way (max) associativity
• Not for 1024-way associativity

26
Bonus

27
Multi-Ported Cache-Style SRAM Power
• Same four components for power
• Decoder: ∞ log2M
• Wordlines: ∞ 2NLP
– Huge Capacitance(C) per wordline (drives 2N gates)
+ But only one ever high at any time (overall consumption low)
• Bitlines: ∞ MLP
– C lower than wordlines, but large
+ Vswing << V DD (C * Vswing2 * f)
• Muxes + sense-amps: constant 0 1

• 32KB SRAM: sense-amps are 60–70%


1 0

• How does power scale?


sa sa

sa sa

28
A Banked Cache
• Banking a cache
• Simple: bank SRAMs
• Which address bits determine bank? LSB of index
• Bank network assigns accesses to banks, resolves conflicts
– Adds some latency too
0 1

1022 1023
! !
= =

t ! t !
[31:12] [11:3] 0 1:0 [31:12] [11:3] 1 1:0
<< <<
address0 data0 hit0? address1 data1 hit1?
29
SRAM Summary
• Large storage arrays are not implemented “digitally”
• SRAM implementation exploits analog transistor properties
• Inverter pair bits much smaller than latch/flip-flop bits
• Wordline/bitline arrangement gives simple “grid-like” routing
• Basic understanding of read, write, read/write ports
• Wordlines select words
• Overwhelm inverter-pair to write
• Drain pre-charged line or swing voltage to read
• Latency proportional to √#bits * #ports

30
Aside: Physical Cache Layout I
• Logical layout 0 512
1 513
• Data and tags mixed together
2 514
• Physical layout
• Data and tags in separate RAMs
510 1022
511 1023

= =

!
[31:11] [10:2] 1:0
<<

hit? address data


31
Physical Cache Layout II
• Logical layout 0 512
1 513
• Data array is monolithic
2 514
• Physical layout
• Each data “way” in separate array
510 1022
511 1023

[31:11] [10:2] 1:0


<<

address data
32
Physical Cache Layout III
• Logical layout 0 512

• Data blocks are contiguous 1 513

• Physical layout
2 514

• Only if full block needed on read


• E.g., I$ (read consecutive words) 510 1022
• E.g., L2 (read block to fill D$,I$) 511 1023
• For D$ (access size is 1 word) …
• Words in same data blocks are bit-interleaved
• Word0.bit0 adjacent to word1.bit0
+ Builds word selection logic into array
+ Avoids duplicating sense-amps/muxes
word3 word2 word1 word0

[31:11] [10:2] 1:0

address data
33
Physical Cache Layout IV
• Logical layout 0 512
1 513
• Arrays are vertically contiguous
• Physical layout 255 767
• Vertical partitioning to minimize wire lengths
! !
• H-tree: horizontal/vertical partitioning layout
• Applied recursively
256 768
• Each node looks like an H
510 1022
511 1023

address data

34
Physical Cache Layout
• Arrays and h-trees make caches easy to spot in μgraphs

35
Full-Associativity
0 1 1022 1023

↓ ↓
= = = =

[31:2] 1:0

• How to implement full (or at least high) associativity?


• 1K tag matches? unavoidable, but at least tags are small
• 1K data reads? Terribly inefficient

36

You might also like