Computer Science 146
Computer Architecture
Spring 2004
Harvard University
Instructor: Prof. David Brooks
dbrooks@eecs.harvard.edu
Lecture 17: Main Memory
Computer Science 146
David Brooks
Course Outline Revisited
W1
Feb 4
Introduction
Ch. 1
W2
Feb 9/11
Measuring Perf.
ISA Design
Ch. 2, A.1-3
W3
Feb 16/18
Holiday/No Class
Basic Pipelining
A.4-11, Ch3
W4
Feb 23/25
Multicycle/Scoreboard
Tomasulos Algorithm
Ch 3
W5
Mar 1/3
Branch Pred./Fetch
Mult. Issue/Speculation
Ch 3
W6
Mar 8/10
Processor Case Studies
Static Issue
Ch 3/4
Ch 4
W7
Mar 15/17
Static ILP vs. HW ILP
IA64 Study/Review
W8
Mar 22/24
IA64 Study/Review
Midterm
W9
Mar 29/31
W10
Apr 5/7
Caches
Spring Break
Caches
Ch 5
W11
Apr 12/14
Caches
Main Memory
Ch 5
W12
Apr 19/21
Virtual Memory
Shared Memory MPs
Papers
W13
Apr 26/28
Multithreading (SMT/MP)
Storage, I/O, Clusters
Ch 6/7
W14
May 3/5
Security Processors
Network/GPU Processors
Papers
Lecture Outline
Main Memory
Computer Science 146
David Brooks
Main Memory Background
Random Access Memory
Different flavors at different levels
Physical Makeup (CMOS, DRAM)
Low Level Architectures (FPM,EDO,SDRAM,RAMBUS)
Cache uses SRAM: Static Random Access Memory
No refresh (6 transistors/bit vs. 1 transistor
Size: DRAM/SRAM - 4-8x,
Cost and Cycle time: SRAM/DRAM - 8-16x
Main Memory is DRAM: Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically (8 ms, 1% time)
Addresses divided into 2 halves (Memory as a 2D matrix):
RAS or Row Access Strobe
CAS or Column Access Strobe
Computer Science 146
David Brooks
Static RAM (SRAM)
Six transistors in cross
connected fashion
Provides regular AND
inverted outputs
Implemented in CMOS
process
Single Port 6-T SRAM Cell
Computer Science 146
David Brooks
Dynamic RAM
SRAM cells exhibit
high speed/poor density
DRAM: simple
transistor/capacitor
pairs in high density
form
Word Line
.
.
.
Bit Line
Sense Amp
Computer Science 146
David Brooks
DRAM Operations
Write
Charge bitline HIGH or LOW and set wordline HIGH
Read
Bit line is precharged to a voltage halfway
between HIGH and LOW, and then the
C
word line is set HIGH.
Depending on the charge in the cap, the
precharged bitline is pulled slightly higher or lower.
Sense Amp Detects change
Word
Line
.
.
.
Reads are destructive (Must follow with a write)
Must refresh capacitor every so often
Access Time = Time to Read
Cycle Time = Time between Reads
Bit Line
Sense
Amp
Computer Science 146
David Brooks
DRAM logical organization
Column Decoder
Sense Amps & I/O
Row Decoder
11
Address Buffer
Square Row/Column Matrix
Multiplex Address Lines to save
pins
Internal Row Buffer
Put Row Address on Lines
Set RAS
Read row into row buffer
Put Column Adddress on Lines
Set CAS
Read Column bits out of row
buffer
Memory Array
(2,048 x 2,048)
Storage
Word Line Cell
Computer Science 146
David Brooks
Vanilla DRAM Read
Access Time
Cycle Time
Timing diagrams from Ars Technica
RAM Guide
Computer Science 146
David Brooks
Fast Page DRAM
Computer Science 146
David Brooks
Extended Data Out (EDO) DRAM
Computer Science 146
David Brooks
Synchronous DRAM
DDR SDRAM: Transmit
Data on Both Clock Edges
Comparison with SRAM
By its nature, DRAM isnt built for speed
Response times dependent on capacitive circuit
properties which get worse as density increases
DRAM process isnt easy to integrate into
standard CMOS process
SRAM:
Optimized for speed (8x - 16x DRAM), not density
Bits not erased on read
No refresh, access time = cycle time
Computer Science 146
David Brooks
Main Memory Organizations
Simple:
CPU, Cache, Bus, Memory
same width
(32 or 64 bits)
Wide:
CPU/Mux 1 word;
Mux/Cache, Bus, Memory N
words (Alpha: 64 bits & 256
bits; UtraSPARC 512)
Interleaved:
CPU, Cache, Bus 1 word:
Memory N Modules
(4 Modules); example is word
interleaved
Computer Science 146
David Brooks
Main Memory Configurations
Simple Main Memory
32-bit DRAM (1 word of data at a time)
Access time: 2 cycles (A)
Transfer time: 1 cycle (T)
Cycle Time: 4 cycles (B = cycle time access time)
Miss penalty for a 4-word block?
Computer Science 146
David Brooks
Simple Main Memory
Cycle
Addr
Mem
steady
12
T/B
T/B
13
10
14
11
T/B
4 word access = 15
cycles
4-word cycle = 16
cycles
How to improve?
Lower latency?
A,B,T are fixed
14
Higher bandwidth?
15
T/B
16
12
13
15
Bandwidth: Wider DRAMs
Cycle
Addr
Mem
steady
12
T/B
T/B
14
64-bit DRAM instead
4 word access = 7 cycles
4-word cycle = 8 cycles
64-bit buses are more
expensive (Pentium vs.
486)
Computer Science 146
David Brooks
Bandwidth: Interleaving/Banking
Use Multiple DRAMs, exploit their aggregate
bandwidth
Each DRAM is called a bank
M 32-bit banks
Word A in bank (A % M) at (A div M)
Simple interleaving: banks share address lines
Computer Science 146
David Brooks
Simple Interleaving
Cycle
Addr
Bank0
Bank1
Bank2
Bank3
12
T/B
T/B
steady
*
T
4-word access = 6-cycles
4-word cycle = 4-cycles
Can start a new access in cycle 5
Overlap access with transfer (and still use a 32-bit bus!)
Computer Science 146
David Brooks
Complex Interleaving
Simple interleaving: banks share address lines
Complex interleaving: banks are independent
More expensive (separate address lines for each bank)
address3
address2
address1
address0
address
B0 B1 B2 B3
data
B0 B1 B2 B3
data
Computer Science 146
David Brooks
10
Complex Interleaving
Bank1
Bank2
Bank3
steady
T/B
T/B
Cycle
Addr
Bank0
12
13
14
T/B
15
T/B
5
6
7
4-word access = 6-cycles
4-word cycle = 4-cycles
Same as simple interleaving
Computer Science 146
David Brooks
Simple Interleaving (Non-Sequential)
Cycle
Addr
Bank0
Bank1
Bank2
Bank3
steady
12(15)
*
*
T/B
T/B
T/B
18
10
21
11
T/B
12
Non-sequential access, e.g. stride = 3
4-word access = 4-word cycle = 12-cycles
11
Complex Interleaving
(Non-Sequential)
Cycle
Addr
Bank0
12
15
18
T/B
21
Bank1
Bank2
Bank3
steady
*
*
T/B
T/B
T/B
4-word access = 6-cycles
4-word cycle = 4-cycles
DMA (I/O), Multiprocessors are non-sequential
Want more banks than words in a cache line
Multiple cache misses in parallel (non-blocking caches)
Computer Science 146
David Brooks
Interleaving Problem
Bank1
Bank2
Bank3
Cycle
Addr
Bank0
12
steady
*
*
T/B
20
T/B
Powers of 2 strides are a problem all addresses, same bank
4-word access = 15 cycles, 4-word cycle = 16 cycle
Solution: Use prime number of banks (e.g. 17)
Computer Science 146
David Brooks
12
Avoiding Bank Conflicts
Lots of banks
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
Even with 128 banks, since 512 is multiple of 128, conflict on
word accesses
SW: loop interchange or declaring array not power of 2 (array
padding)
HW: Add more Banks, Add Prime number of banks
bank number = address mod number of banks
address within bank = address / number of words in bank
modulo & divide per memory access with prime no. banks?
address within bank = address mod number words in bank
bank number? easy if 2N words per bank
Independent Memory Banks
How many banks?
number banks number clocks to access word in bank
For sequential accesses, otherwise will return to
original bank before it has next word ready
Increasing DRAM => fewer chips => less banks
4 banks/chip * 1 rank = 4 total banks
16 banks/chip * 8 chips = 128 banks
13
Independent Memory Banks
DIMM (Dual-Inline Memory Module)
Configuration
Banking occurs at the chip, module, and
system levels
1 Rank of devices responds to each access
All devices respond similarly
Single-Sided DIMM
4 banks per device => DIMM has 4 banks
512MB DIMM = 8x64Mx8, 4 Banks
RAMBUS (RDRAM)
Protocol based RAM w/ narrow (16-bit) bus
High clock rate (400 Mhz), but long latency
Pipelined operation
Multiple arrays w/ data transferred on both edges
of clock
RAMBUS Bank
RDRAM Memory System
14
RDRAM Timing
Computer Science 146
David Brooks
Independent Memory Banks
Standard PC Upgrade Path
Traditional DIMMS => 8 devices at a time with 8-bit chips
Rambus RIMMs => One at a time
Successful Markets: PlayStation 2 (High Bandwidth, Small
Memory)
Rambus: 400MHz, 16-bits per channel, 2-bits per clock
1.6GB/sec per channel (only 1 chip needed)
2 Rambus Channels in Parallel, 3.2GB/sec memory bandwidth
Traditional:PC100 SDRAM: 100MHz, 1-bit per clock
Would need 32 chips to achieve 3.2GB/sec bandwidth
Computer Science 146
David Brooks
15
Interleaving Summary
Banks
Method to get high bandwidth with cheap (narrow) bus
Bandwidth determines memory capacity
Hard to make many banks from narrow DIMMs
32, 64-bit banks from 1x64MB DRAMS => 2048 DIMMS =>
4GB
Cant force customers to buy so much memory to get good
bandwidth
Must use wider DRAMs
RAMBUS does better with small memory systems (PS2)
Big servers have lots of memory so traditional banking works
Computer Science 146
David Brooks
Next Time
Multiprocessors
Computer Science 146
David Brooks
16