Advanced Computer Architecture I
Addresses
SRAM Technology
• SRAM: static RAM
• Static: bits directly connected to power/ground
• Naturally/continuously “refreshed”, never decay (contrast DRAM)
• Designed for speed
• Implements all storage arrays in real processors
• Register file, caches, branch predictor, etc.
• Everything except pipeline latches
• Latches vs. SRAM
• Latches: singleton word, always read/write same one
• SRAM: array of words, can read/write different ones
• Address indicates which one
4
(CMOS) Memory Components
• Interface
address ! data • N-bit address bus (on N-bit machine)
• Data bus
M • Can have read/write on same data bus
• Or, can have dedicated read/write buses
• Can have multiple ports: address/data bus pairs
5
SRAM: First Cut
write-data1 write-data0 • 4x2 (4 2-bit words) RAM
wr ite-addr
• 2-bit addr
0 0
• First cut: bits are D-Latches
• Write port
1 1 • Addr decodes to enable signals
• Read port
1 0 • Addr decodes to mux selectors
read-addr
– 1024 input OR gate?
– Physical layout of output wires
0 1 • RAM width ∞ M
• Wire delay ∞ wire length
read-data1 read-data0
6
SRAM: Second Cut
write-data1 write-data0 • Second cut: tri-state wired-OR
wr ite-addr
• Read mux using tri-states
0 0
+ Scalable, distributed “muxes”
+ Better layout of output wires
1 1 • RAM width independent of M
一
• Standard RAM
read-addr一
1 0
• Bits in word connected by wordline
• 1-hot decode address
0 1
• Bits in position connected by bitline
• Shared input/output wires
• Port: one set of wordlines/bitlines
• Grid-like design
read-data1 read-data0
7
SRAM: Third Cut
IN
• Third cut: replace latches with …
WE OUT – 28 transistors per bit
• Cross-coupled inverters (CCI)
+ 4 transistors
• Convention
• Right node is bit, left is ~bit
~bit bit • Non-digital interface
IN? OUT? OUT? IN? • What is the input and output?
• Where is write enable?
• Implement ports in “analog” way
• Transistors, not full gates
8
SRAM: Register Files and Caches
• Two different SRAM port styles
• Regfile style
• Modest size: <4KB
• Many ports: some read-only, some write-only
• Write and read both take half a cycle (write first, read second)
• Cache style
• Larger size: >8KB
• Few ports: read/write in a single port
• Write and read can both take full cycle
9
Regfile-Style Read Port
CLK • Two phase read
• Phase I: clk = 0
• Pre-charge bitlines to 1
0 1 • Negated bitlines are 0
w ordline0 • Phase II: clk = 1
• One wordline goes high
raddr
/
• All “1” bits in that row
1 0 discharge their bitlines to 0
• Negated bitlines go to 1
wordline1
bitline1
bitline0
rdata1 rdata0
10
Read Port In Action: Phase I
CLK=0 • CLK = 0
• p-transistors conduct
• Bitlines “pre-charge” to 1
0 1 • rdata1-0 are 0
raddr
1 0
1 1
/ /
rdata1 0 rdata0 0
11
Read Port In Action: Phase II
CLK=1 • raddr = 1
• CLK = 1
• p-transistors close
0 1
• wordline1 = 1
• “1” bits on wordline1 create path
from bitline to ground
raddr
• SRAM[1]
1 0
• Corresponding bitlines discharge
• bitline1
• Corresponding rdata bits go to 1
0 1 • rdata1
/
rdata1 1 rdata0 0 • That’s a read
12
Regfile-Style Write Port
wdata1 wdata0
CLK • Two phase write
• Phase I: clk = 1
• Stabilize one wordline high
0 1 • Phase II: clk = 0
• Open pass-transistors
waddr
• “Overwhelm” bits in selected word
• Actually: two clocks here
1 0 • Both phases in first half
-因 pass transistor: like a tri-state buffer
13
A 2-Read Port 1-Write Port Regfile
wdata1 wdata0
CLK
0 1
RD
RS1
SRAM cell
RS2
1 0
rdata11 rdata21 rdata10
rdata20 14
Cache-Style Read/Write Port
~wdata1 Double-ended bitlines
WE&~CLK wdata1~wdata0wdata0• • Connect to both sides of bit
• Two-phase write
RE&CLK
• Just like a register file
• Two phase read
RE&~CLK || 0 1
WE&CLK • Phase I: clk = 1
• Equalize bitline pair voltage
• Phase II: clk = 0
addr 1 1 • One wordline high
• “1 side” bitline swings up
• “0 side” bitline swings down
sense-amplifier sense-amplifier • Sens-amp translates swing
rdata1 rdata0
15
Read/Write Port in Read Action: Phase I
• Phase I: clk = 1
• Equalize voltage on bitline pairs
• To (nominally) 0.5
RE&CLK
0 1
RE&~CLK
addr
1 1
~bit bit
0.5 0.5 0.5 0.5
sense-amplifier sense-amplifier
rdata1 rdata0
16
Read/Write Port in Read Action: Phase II
• Phase II: clk = 0
• wordline1 goes high
• “1 side” bitlines swing high 0.6
RE&CLK • “0 side” bitlines swing low 0.4
• Sens-amps interpret swing
0 1
RE&~CLK
addr
1 0
~bit bit
0.4 0.6 0.6 0.4
sense-amplifier sense-amplifier
1 0
rdata1 rdata0
17
Cache-Style SRAM Latency
N
• Assume
• M N-bit words 0 1
M
• Some minimum wire spacing L 1 0
• CCIs occupy no space
sa sa
• 4 major latency components: taken in series
• Decoder: ∞ log2M
• Wordlines: ∞ 2NL (cross 2N bitlines)
• Bitlines: ∞ ML (cross M wordlines)
• Muxes + sense-amps: constant
• 32KB SRAM: red components contribute about equally
• Latency: ∞ (2N+M)L
• Maximize storage for some max latency: make SRAMs as
square as possible: minimize 2N+M
• Latency: ∞ √#total bits
18
Multi-Ported Cache-Style SRAM Latency
• Previous calculation had hidden constant
0 1
• Number of ports P
1 0
• Recalculate latency components
• Decoder: ∞ log2M (unchanged) sa sa
• Wordlines: ∞ 2NLP (cross 2NP bitlines)
• Bitlines: ∞ MLP (cross MP wordlines)
• Muxes + sense-amps: constant (unchanged)
0 1
• Latency: ∞ (2N+M)LP
• Latency: ∞ √#bits * #ports 1 0
• How does latency scale? (P)
sa sa
• How does power scale? (P^2)
sa sa
• Both length and number active increase
19
Multi-Porting an SRAM
• Why multi-porting?
• Multiple accesses per cycle
• True multi-porting (physically adding a port) not good
+ Any combination of accesses will work
– Increases access latency, energy ∞ P, area ∞ P2
• Another option: pipelining
• Timeshare single port on clock edges (wave pipelining: no latches)
+ Negligible area, latency, energy increase
– Not scalable beyond 2 ports
• Yet another option: replication
• Don’t laugh: used for register files, even caches (Alpha 21164)
• Smaller and faster than true multi-porting 2*P2 < (2*P)2
+ Adds read bandwidth, any combination of reads will work
– Doesn’t add write bandwidth, not really scalable beyond 2 ports
20
Banking an SRAM
1020 1021
• Divide SRAM into banks, 1022 1023
interleave the addresses
• Allow parallel access to
different banks
• Two accesses to same bank? ↓
bank-conflict, one waits
• Low area, latency overhead for routing requests to banks
• Few bank conflicts given sufficient number of banks
• Rule of thumb: N simultaneous accesses → 2 N banks
• How to divide words among banks?
• Round robin: using address LSB (least significant bits)
• Example: 16 word RAM divided into 4 banks
• b0: 0,4,8,12; b1: 1,5,9,13; b2: 2,6,10,14; b3: 3,7,11,15
• Why? Spatial locality
21
Full-Associativity with CAMs
CAM
• CAM: content associative memory = 0
= 1
• Array of words with built-in comparators
• Matchlines instead of bitlines
• Output is “one-hot” encoding of match = 1022
= 1023
• FA cache?
• Tags as CAM [31:2] 1:0
• Data as RAM
tag Cache hit data
• Hardware is not software
• No such thing as software CAM
22
CAM Circuit
match1 ~match1 match0 ~match0
CLK
• CAM: reverse RAM
• Bitlines are inputs
0 1
• Called matchlines
• Wordlines are outputs
• Two phase match
1 0 • Phase I: clk=0
• Pre-charge wordlines
• Phase II: clk=1
• Enable matchlines
• Non-matching bits
dis-charge wordlines
23
CAM Circuit In Action: Phase I
match1 ~match1 match0 ~match0
CLK 0 1 1 0
• Phase I: clk=0
• Pre-charge wordlines
0 1
1 0
24
CAM Circuit In Action: Phase II
match1 ~match1 match0 ~match0
CLK 0 1 1 0
• Phase II: clk=1
• Enable matchlines
0 1 • Note: bits flipped
1 • Non-matching bit
discharges wordline
1 0 • ANDs matches
• NORs non-matches
0 • Similar technique for
doing a fast OR for hit
detection
~bit bit
25
CAM Upshot
• CAMs: effective but expensive
– Matchlines are very expensive (for nasty EE reasons)
• Used but only for 16 or 32 way (max) associativity
• Not for 1024-way associativity
26
Bonus
27
Multi-Ported Cache-Style SRAM Power
• Same four components for power
• Decoder: ∞ log2M
• Wordlines: ∞ 2NLP
– Huge Capacitance(C) per wordline (drives 2N gates)
+ But only one ever high at any time (overall consumption low)
• Bitlines: ∞ MLP
– C lower than wordlines, but large
+ Vswing << V DD (C * Vswing2 * f)
• Muxes + sense-amps: constant 0 1
• 32KB SRAM: sense-amps are 60–70%
1 0
• How does power scale?
sa sa
sa sa
28
A Banked Cache
• Banking a cache
• Simple: bank SRAMs
• Which address bits determine bank? LSB of index
• Bank network assigns accesses to banks, resolves conflicts
– Adds some latency too
0 1
1022 1023
! !
= =
t ! t !
[31:12] [11:3] 0 1:0 [31:12] [11:3] 1 1:0
<< <<
address0 data0 hit0? address1 data1 hit1?
29
SRAM Summary
• Large storage arrays are not implemented “digitally”
• SRAM implementation exploits analog transistor properties
• Inverter pair bits much smaller than latch/flip-flop bits
• Wordline/bitline arrangement gives simple “grid-like” routing
• Basic understanding of read, write, read/write ports
• Wordlines select words
• Overwhelm inverter-pair to write
• Drain pre-charged line or swing voltage to read
• Latency proportional to √#bits * #ports
30
Aside: Physical Cache Layout I
• Logical layout 0 512
1 513
• Data and tags mixed together
2 514
• Physical layout
• Data and tags in separate RAMs
510 1022
511 1023
= =
!
[31:11] [10:2] 1:0
<<
hit? address data
31
Physical Cache Layout II
• Logical layout 0 512
1 513
• Data array is monolithic
2 514
• Physical layout
• Each data “way” in separate array
510 1022
511 1023
[31:11] [10:2] 1:0
<<
address data
32
Physical Cache Layout III
• Logical layout 0 512
• Data blocks are contiguous 1 513
• Physical layout
2 514
• Only if full block needed on read
• E.g., I$ (read consecutive words) 510 1022
• E.g., L2 (read block to fill D$,I$) 511 1023
• For D$ (access size is 1 word) …
• Words in same data blocks are bit-interleaved
• Word0.bit0 adjacent to word1.bit0
+ Builds word selection logic into array
+ Avoids duplicating sense-amps/muxes
word3 word2 word1 word0
[31:11] [10:2] 1:0
address data
33
Physical Cache Layout IV
• Logical layout 0 512
1 513
• Arrays are vertically contiguous
• Physical layout 255 767
• Vertical partitioning to minimize wire lengths
! !
• H-tree: horizontal/vertical partitioning layout
• Applied recursively
256 768
• Each node looks like an H
510 1022
511 1023
address data
34
Physical Cache Layout
• Arrays and h-trees make caches easy to spot in μgraphs
35
Full-Associativity
0 1 1022 1023
↓ ↓
= = = =
[31:2] 1:0
• How to implement full (or at least high) associativity?
• 1K tag matches? unavoidable, but at least tags are small
• 1K data reads? Terribly inefficient
36