Lecture 2: Processor Design,
Single-Processor Performance
G63.2011.002/G22.2945.001 September 14, 2010
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Intro Basics Assembly Memory Pipelines
Admin Bits
Lec. 1 slides posted
New here? Welcome!
Please send in survey info (see lec. 1 slides) via email
PASI
Please subscribe to mailing list
Near end of class: 5-min, 3-question concept check
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Intro Basics Assembly Memory Pipelines
Introduction
Goal for Today
High Performance Computing:
Discuss the actual computer end of this. . .
. . . and its influence on performance
Intro Basics Assembly Memory Pipelines
Whats in a computer?
Intro Basics Assembly Memory Pipelines
Whats in a computer?
Processor
Intel Q6600 Core2 Quad, 2.4 GHz
Intro Basics Assembly Memory Pipelines
Die
Whats in a computer?
Processor
(2) 143 mm2 , 2 2 cores
Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors
100W
Intro Basics Assembly Memory Pipelines
Whats in a computer?
Intro Basics Assembly Memory Pipelines
Whats in a computer?
Memory
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Intro Basics Assembly Memory Pipelines
A Basic Processor
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
(loosely based on Intel 8086)
Intro Basics Assembly Memory Pipelines
A Basic Processor
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
Bonus
fetch
Question:
PC
Whats a Control
bus? Unit
Data ALU
(loosely based on Intel 8086)
Intro Basics Assembly Memory Pipelines
How all of this fits together
Everything synchronizes to the Clock.
Control Unit (CU): The brains of the
operation. Everything connects to it.
Memory Interface
Address ALU
Bus entries/exits are gated and
(potentially) buffered.
CU controls gates, tells other units
about what and how:
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
What operation?
Which register?
Which addressing mode?
Intro Basics Assembly Memory Pipelines
What is. . . an ALU?
Arithmetic Logic Unit
One or two operands A, B
Operation selector (Op):
(Integer) Addition, Subtraction
(Logical) And, Or, Not
(Bitwise) Shifts (equivalent to
multiplication by power of two)
(Integer) Multiplication, Division
Op
Specialized ALUs:
Floating Point Unit (FPU)
Address ALU
Operates on binary representations of
numbers. Negative numbers represented
by twos complement.
Intro Basics Assembly Memory Pipelines
What is. . . a Register File?
Registers are On-Chip Memory
Directly usable as operands in
Machine Language
Often general-purpose
Sometimes special-purpose: Floating
point, Indexing, Accumulator
Small: x86 64: 1664 bit GPRs
Very fast (near-zero latency)
%r0
%r1
%r2
%r3
%r4
%r5
%r6
%r7
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Intro Basics Assembly Memory Pipelines
How does computer memory work?
One (reading) memory transaction (simplified):
D0..15
Memory
Processor
A0..15
R/W
CLK
Observation: Access (and addressing) happens
in bus-width-size chunks.
Intro Basics Assembly Memory Pipelines
What is. . . a Memory Interface?
Memory Interface gets and stores binary
words in off-chip memory.
Smallest granularity: Bus width
Tells outside memory
where through address bus
what through data bus
Computer main memory is Dynamic RAM
(DRAM): Slow, but small and cheap.
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Intro Basics Assembly Memory Pipelines
A Very Simple Program
int a = 5;
int b = 17;
int z = a b;
4:
b:
12:
15:
19:
1c:
c7
c7
8b
0f
89
8b
45
45
45
af
45
45
f4 05 00 00 00 movl
f8 11 00 00 00 movl
f4
mov
45 f8
imul
fc
mov
fc
mov
$0x5,0xc(%rbp)
$0x11,0x8(%rbp)
0xc(%rbp),%eax
0x8(%rbp),%eax
%eax,0x4(%rbp)
0x4(%rbp),%eax
Things to know:
Addressing modes (Immediate, Register, Base plus Offset)
0xHexadecimal
AT&T Form: (well use this)
<opcode><size> <source>, <dest>
Intro Basics Assembly Memory Pipelines
Another Look
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
Intro Basics Assembly Memory Pipelines
Another
Look
4: c7 45 f4 05 00 00 00 movl $0x5,0xc(%rbp)
Address ALU
b:
12:
15:
19:
1c:
c7
8b
0f
89
8b
45 f8 11 00 00 00 movl $0x11,0x8(%rbp)
45 f4
mov 0xc(%rbp),%eax
af 45 f8
imul 0x8(%rbp),%eax
45 fc
mov %eax,0x4(%rbp)
45 fc
mov 0x4(%rbp),%eax
Memory Interface
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
Intro Basics Assembly Memory Pipelines
A Very Simple Program: Intel Form
4:
b:
12:
15:
19:
1c:
c7
c7
8b
0f
89
8b
45
45
45
af
45
45
f4 05 00 00 00
f8 11 00 00 00
f4
45 f8
fc
fc
mov
mov
mov
imul
mov
mov
DWORD PTR [rbp0xc],0x5
DWORD PTR [rbp0x8],0x11
eax,DWORD PTR [rbp0xc]
eax,DWORD PTR [rbp0x8]
DWORD PTR [rbp0x4],eax
eax,DWORD PTR [rbp0x4]
Intel Form: (you might see this on the net)
<opcode> <sized dest>, <sized source>
Goal: Reading comprehension.
Dont understand an opcode?
Google <opcode> intel instruction.
Intro Basics Assembly Memory Pipelines
Machine Language Loops
int main()
{
int y = 0, i ;
for ( i = 0;
y < 10; ++i)
y += i;
return y;
}
0:
1:
4:
b:
12:
14:
17:
1a:
1e:
22:
24:
27:
28:
55
48
c7
c7
eb
8b
01
83
83
7e
8b
c9
c3
89
45
45
0a
45
45
45
7d
f0
45
e5
f8 00 00 00 00
fc 00 00 00 00
fc
f8
fc 01
f8 09
f8
push
mov
movl
movl
jmp
mov
add
addl
cmpl
jle
mov
leaveq
retq
%rbp
%rsp,%rbp
$0x0,0x8(%rbp)
$0x0,0x4(%rbp)
1e <main+0x1e>
0x4(%rbp),%eax
%eax,0x8(%rbp)
$0x1,0x4(%rbp)
$0x9,0x8(%rbp)
14 <main+0x14>
0x8(%rbp),%eax
Things to know:
Condition Codes (Flags): Zero, Sign, Carry, etc.
Call Stack: Stack frame, stack pointer, base pointer
ABI: Calling conventions
Intro Basics Assembly Memory Pipelines
Machine Language Loops
int main()
{
int y = 0, i ;
for ( i = 0;
y < 10; ++i)
y += i;
return y;
}
0:
1:
4:
b:
12:
14:
17:
1a:
1e:
22:
24:
27:
28:
55
48
c7
c7
eb
8b
01
83
83
7e
8b
c9
c3
89
45
45
0a
45
45
45
7d
f0
45
e5
f8 00 00 00 00
fc 00 00 00 00
fc
f8
fc 01
f8 09
f8
push
mov
movl
movl
jmp
mov
add
addl
cmpl
jle
mov
leaveq
retq
%rbp
%rsp,%rbp
$0x0,0x8(%rbp)
$0x0,0x4(%rbp)
1e <main+0x1e>
0x4(%rbp),%eax
%eax,0x8(%rbp)
$0x1,0x4(%rbp)
$0x9,0x8(%rbp)
14 <main+0x14>
0x8(%rbp),%eax
Things to know:
Want to make those yourself?
Condition
(Flags): Zero, Sign, Carry, etc.
WriteCodes
myprogram.c.
Call Stack:
Stack
frame, stack pointer, base pointer
$ cc -c
myprogram.c
objdump
--disassemble myprogram.o
ABI: $
Calling
conventions
Intro Basics Assembly Memory Pipelines
We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer:
Intro Basics Assembly Memory Pipelines
We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
Intro Basics Assembly Memory Pipelines
We know how a computer works!
All of this can be built in about 4000 transistors.
(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
Goal now:
Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing
Intro Basics Assembly Memory Pipelines
The High-Performance Mindset
Writing high-performance Codes
Mindset: What is going to be the limiting
factor?
ALU?
Memory?
Communication? (if multi-machine)
Benchmark the assumed limiting factor right
away.
Evaluate
Know your peak throughputs (roughly)
Are you getting close?
Are you tracking the right limiting factor?
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Intro Basics Assembly Memory Pipelines
Source of Slowness: Memory
Memory is slow.
Distinguish two different versions of slow:
Bandwidth
Latency
Memory has long latency, but can have large bandwidth.
Size of die vs. distance to memory: big!
Dynamic RAM: long intrinsic latency!
Intro Basics Assembly Memory Pipelines
Source of Slowness: Memory
Memory is slow.
Distinguish two different versions of slow:
Bandwidth
Latency
Memory has long latency, but can have large bandwidth.
Idea:
Size of die vs. distance to memory: big!
Dynamic RAM: long intrinsic latency!
Put a look-up table of
recently-used data onto
the chip.
Cache
Intro Basics Assembly Memory Pipelines
The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:
Registers
1 kB, 1 cycle
L1 Cache
10 kB, 10 cycles
L2 Cache
1 MB, 100 cycles
DRAM
1 GB, 1000 cycles
Virtual Memory
(hard drive)
1 TB, 1 M cycles
Intro Basics Assembly Memory Pipelines
The Memory Hierarchy
Hierarchy of increasingly bigger, slower memories:
Registers
1 kB, 1 cycle
L1 Cache
10 kB, 10 cycles
L2 Cache
1 MB, 100 cycles
DRAM
1 GB, 1000 cycles
Virtual Memory
(hard drive)
1 TB, 1 M cycles
How might data locality
factor into this?
What is a working set?
Intro Basics Assembly Memory Pipelines
Cache: Actual Implementation
Demands on cache implementation:
Fast, small, cheap, low-power
Fine-grained
High hit-rate (few misses)
Main
Memory
Index
0
1
2
3
Data
xyz
pdq
abc
rgf
Cache
Memory
Index Tag Data
0 2 abc
1 0 xyz
Problem:
Goals at odds with each other: Access matching logic expensive!
Solution 1: More data per unit of access matching logic
Larger Cache Lines
Solution 2: Simpler/less access matching logic
Less than full Associativity
Other choices: Eviction strategy, size
Intro Basics Assembly Memory Pipelines
Cache: Associativity
Direct Mapped
Memory
0
1
2
3
4
5
6
..
.
Cache
0
1
2
3
2-way set associative
Memory
0
1
2
3
4
5
6
..
.
Cache
0
1
2
3
Intro Basics Assembly Memory Pipelines
Cache: Associativity
0.1
Direct
2-way
4-way
8-way
Full
0.01
2-way set associative
Direct Mapped
Memory
0
1
2
3
4
5
6
..
.
miss rate
0.001
Cache
0
0.0001
1
2
1e-005
3
1e-006
1K
4K
Memory
0
1
2
3
4
5
6
..
. 64K
16K
Cache
0
1
2
3
256K
1M
Inf
cache size
Miss rate versus cache size on the Integer portion of SPEC CPU2000 [Cantin, Hill 2003]
Intro Basics Assembly Memory Pipelines
Cache Example: Intel Q6600/Core2 Quad
--- L1 data cache --fully associative cache
=
threads sharing this cache =
processor cores on this die=
system coherency line size =
ways of associativity
=
number of sets - 1 (s)
=
false
0x0 (0)
0x3 (3)
0x3f (63)
0x7 (7)
63
--- L1 instruction --fully associative cache
=
threads sharing this cache =
processor cores on this die=
system coherency line size =
ways of associativity
=
number of sets - 1 (s)
=
false
0x0 (0)
0x3 (3)
0x3f (63)
0x7 (7)
63
--- L2 unified cache --fully associative cache
false
threads sharing this cache = 0x1 (1)
processor cores on this die= 0x3 (3)
system coherency line size = 0x3f (63)
ways of associativity
= 0xf (15)
number of sets - 1 (s)
= 4095
More than you care to know about your CPU:
http://www.etallen.com/cpuid.html
Intro Basics Assembly Memory Pipelines
Measuring the Cache I
void go(unsigned count, unsigned stride)
{
const unsigned arr size = 64 1024 1024;
int ary = (int ) malloc(sizeof (int) arr size );
for (unsigned it = 0; it < count; ++it)
{
for (unsigned i = 0; i < arr size ; i += stride)
ary [ i ] = 17;
}
free (ary );
}
Intro Basics Assembly Memory Pipelines
Measuring the Cache I
void go(unsigned count, unsigned stride)
{
0.16
const unsigned arr size = 64 1024 1024;
int ary = (int0.14
) malloc(sizeof (int) arr size );
Time [s]
for (unsigned 0.12
it = 0; it < count; ++it)
{
0.10
for (unsigned
i = 0; i < arr size ; i += stride)
ary [ i ] = 0.08
17;
}
free (ary );
}
0.06
0.04
0.020 1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2 2
Stride
Intro Basics Assembly Memory Pipelines
Measuring the Cache II
void go(unsigned array size , unsigned steps)
{
int ary = (int ) malloc(sizeof (int) array size );
unsigned asm1 = array size 1;
for (unsigned i = 0; i < steps; ++i)
ary [( i 16) & asm1] ++;
free (ary );
}
Intro Basics Assembly Memory Pipelines
Measuring the Cache II
6
Eff. Bandwidth [GB/s]
void go(unsigned array size , unsigned steps)
{
int ary = (int ) malloc(sizeof (int) array size );
unsigned asm1 = array size 1;
for (unsigned i = 0; i < steps; ++i)
ary [( i 16) & asm1] ++;
free (ary );
}
2
1
0
212 214 216 218 220 222 224 226
Array Size [Bytes]
Intro Basics Assembly Memory Pipelines
Measuring the Cache III
void go(unsigned array size , unsigned stride , unsigned steps)
{
char ary = (char ) malloc(sizeof (int) array size );
unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
{
ary [p] ++;
p += stride;
if (p >= array size)
p = 0;
}
free (ary );
}
Intro Basics Assembly Memory Pipelines
Measuring the Cache III
void go(unsigned array size , unsigned stride , unsigned steps)
{
char ary = (char ) malloc(sizeof (int) array size );
20
Array Size [MB]
unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
15
{
ary [p] ++;
p += stride;
if (p >= array
10 size)
p = 0;
}
free (ary );
}
100 200 300 400 500 600
Stride [bytes]
Intro Basics Assembly Memory Pipelines
Programming for the Cache
How can we rearrange programs to be cache-friendly?
Examples:
Large vectors x, a, b
Compute
x x + 3a 5b.
Intro Basics Assembly Memory Pipelines
Programming for the Cache
How can we rearrange programs to be cache-friendly?
Examples:
Large vectors x, a, b
Compute
x x + 3a 5b.
Matrix-Matrix Multiplication
Homework 1, posted.
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Intro Basics Assembly Memory Pipelines
Source of Slowness: Sequential Operation
IF Instruction fetch
ID Instruction Decode
EX Execution
MEM Memory Read/Write
WB Result Writeback
Intro Basics Assembly Memory Pipelines
Solution: Pipelining
Intro Basics Assembly Memory Pipelines
Pipelining
(MIPS, 110,000 transistors)
Intro Basics Assembly Memory Pipelines
Issues with Pipelines
Pipelines generally help
performancebut not always.
Stalls
Dependent Instructions
Branches (+Prediction)
Self-Modifying Code
Waiting
Instructions
Stage 1: Fetch
PIPELINE
Possible issues:
Clock Cycle
0
Stage 2: Decode
Stage 3: Execute
Stage 4: Write-back
Completed
Instructions
Solution: Bubbling, extra
circuitry
Intro Basics Assembly Memory Pipelines
Intel Q6600 Pipeline
Instruction Fetch
128 Bit
32 Byte Pre-Decode,
Fetch Buffer
6 Instructions
18 Entry
Instruction Queue
Complex
Decoder
Microcode
Simple
Decoder
Simple
Decoder
4 ops
1 op
Simple
Decoder
1 op
1 op
7+ Entry op Buffer
4 ops
Register Alias Table
and Allocator
4 ops
4 ops
Retirement Register File
(Program Visible State)
96 Entry Reorder Buffer (ROB)
4 ops
32 Entry Reservation Station
Port 0
ALU
SSE
Shuffle
ALU
128 Bit
FMUL
FDIV
ALU
SSE
Shuffle
MUL
128 Bit
FADD
Internal Results Bus
Port 3
Port 5
Port 1
ALU
Branch
SSE
ALU
Store
Address
Port 4
Store
Data
Port 2
Load
Address
Memory Ordering Buffer
(MOB)
128 Bit
Store
128 Bit
Load
Memory Interface
Intro Basics Assembly Memory Pipelines
Intel Q6600 Pipeline
New concept:
Instruction-level
parallelism
(Superscalar)
Instruction Fetch
128 Bit
32 Byte Pre-Decode,
Fetch Buffer
6 Instructions
18 Entry
Instruction Queue
Complex
Decoder
Microcode
Simple
Decoder
Simple
Decoder
4 ops
1 op
Simple
Decoder
1 op
1 op
7+ Entry op Buffer
4 ops
Register Alias Table
and Allocator
4 ops
4 ops
Retirement Register File
(Program Visible State)
96 Entry Reorder Buffer (ROB)
4 ops
32 Entry Reservation Station
Port 0
ALU
SSE
Shuffle
ALU
128 Bit
FMUL
FDIV
ALU
SSE
Shuffle
MUL
128 Bit
FADD
Internal Results Bus
Port 3
Port 5
Port 1
ALU
Branch
SSE
ALU
Store
Address
Port 4
Store
Data
Port 2
Load
Address
Memory Ordering Buffer
(MOB)
128 Bit
Store
128 Bit
Load
Memory Interface
Intro Basics Assembly Memory Pipelines
Programming for the Pipeline
How to upset a processor pipeline:
for (int i = 0; i < 1000; ++i)
for (int j = 0; j < 1000; ++j)
{
if ( j % 2 == 0)
do something(i , j );
}
. . . why is this bad?
Intro Basics Assembly Memory Pipelines
A Puzzle
int steps = 256 1024 1024;
int [] a = new int[2];
// Loop 1
for (int i =0; i<steps; i ++) { a[0]++; a[0]++; }
// Loop 2
for (int i =0; i<steps; i ++) { a[0]++; a[1]++; }
Which is faster?
. . . and why?
Intro Basics Assembly Memory Pipelines
Two useful Strategies
Loop unrolling:
for (int i = 0; i < 1000; ++i)
do something(i );
for (int i = 0; i < 500; i+=2)
{
do something(i );
do something(i+1);
}
for (int i = 0; i < 500; i+=2)
{
do a( i );
do a( i +1);
do b(i );
do b(i+1);
}
Software pipelining:
for (int i = 0; i < 1000; ++i)
{
do a( i );
do b(i );
}
Intro Basics Assembly Memory Pipelines
SIMD
Control Units are large and expensive.
SIMD
Functional Units are simple and cheap.
PU
Data Pool
Increase the Function/Control ratio:
Control several functional units with
one control unit.
Instruction Pool
PU
PU
PU
All execute same operation.
GCC vector extensions:
typedef int v4si
attribute
(( vector size (16)));
v4si a, b, c;
c = a + b;
// +, , , /, unary minus, , |, &, , %
Will revisit for OpenCL, GPUs.
Intro Basics Assembly Memory Pipelines
About HW1
Open-ended! Want: coherent thought about caches, memory
access ordering, latencies, bandwidths, etc.
See Berkeley CS267 (lec. 3, . . . ) for more hints.
Also: Introduction to the machinery.
(Clusters, running on them, SSH, git, forge)
Why so much machinery?
Linux lab machines: WWH 229/230
Intro Basics Assembly Memory Pipelines
Rearranging Matrix-Matrix Multiplication
Matrix Multiplication:
Cij =
Aik Bkj
Intro Basics Assembly Memory Pipelines
Rearranging Matrix-Matrix Multiplication
Matrix Multiplication:
Cij =
Aik Bkj
Intro Basics Assembly Memory Pipelines
Rearranging Matrix-Matrix Multiplication
Matrix Multiplication:
Cij =
Aik Bkj
Intro Basics Assembly Memory Pipelines
Rearranging Matrix-Matrix Multiplication
Matrix Multiplication:
Cij =
Aik Bkj
Intro Basics Assembly Memory Pipelines
Rearranging Matrix-Matrix Multiplication
Matrix Multiplication:
Cij =
Aik Bkj
Intro Basics Assembly Memory Pipelines
Rearranging Matrix-Matrix Multiplication
Matrix Multiplication:
Cij =
Aik Bkj
Intro Basics Assembly Memory Pipelines
Questions?
Intro Basics Assembly Memory Pipelines
Image Credits
Q6600 Wikimedia Commons
Mainboard: Wikimedia Commons
DIMM: sxc.hu/gobran11
Q6600 back: Wikimedia Commons
Core 2 die: Intel Corp. / lesliewong.us
Basic cache: Wikipedia
Cache associativity: based on Wikipedia
Cache associativity vs miss rate: Wikipedia
Q6600 Wikimedia Commons
Cache Measurements: Igor Ostrovsky
Pipeline stuff: Wikipedia
Bubbly Pipeline: Wikipedia
Q6600 Pipeline: Wikipedia
SIMD concept picture: Wikipedia
Intro Basics Assembly Memory Pipelines