Processor Architecture
Victor Eijkhout
Fall 2023
1 Eijkhout – Processor Architecture – Fall 2023
Justification
The performance of a parallel code has as one component the
behaviour of the single processor or single-threaded code. In this
section we discuss the basics of how a processor executes
instructions, and how it handles the data these instructions operate on.
2 Eijkhout – Processor Architecture – Fall 2023
Structure of a modern processor
3 Eijkhout – Processor Architecture – Fall 2023
Von Neumann machine
The ideal processor:
(Stored program)
An instruction contains the operation and two operand locations
Processor decodes instruction, gets operands, computes and
writes back the result
Repeat
4 Eijkhout – Processor Architecture – Fall 2023
The actual state of affairs
Single instruction stream versus multiple cores / floating point
units
Single instruction stream versus Instruction Level Parallelism
Unit-time-addressable memory versus large latencies
Modern processors contain lots of magic to make them seem like Von
Neumann machines.
5 Eijkhout – Processor Architecture – Fall 2023
Complexity measures
Traditional: processor speed was paramount. Operation counting.
Nowadays: memory is slower than processors
This course
Study data movement aspects
Algorithm design for processor reality
6 Eijkhout – Processor Architecture – Fall 2023
A first look at a processor
7 Eijkhout – Processor Architecture – Fall 2023
Structure of a core
8 Eijkhout – Processor Architecture – Fall 2023
Motivation for pipelining
An operation consists of several stages.
Addition:
Decoding the instruction operands.
Data fetch into register
Aligning the exponents:
.35 × 10−1 + .6 × 10−2 becomes
−1
.35 × 10 + .06 × 10 .− 1
Adding mantissas, giving .41.
Normalizing the result, giving .41 × 10−1 .
Storing the result.
pipeline stages
9 Eijkhout – Processor Architecture – Fall 2023
Pipelining, pictorially
Discrete hardware for each stage:
10 Eijkhout – Processor Architecture – Fall 2023
Analysis
Operation timing:
n operations
ℓ number of stages ⇒ t (n) = nℓτ
τ clock cycle
With pipelining:
t (n) = [s + ℓ + n − 1]τ
where s is a setup cost
⇒ Asymptotic speedup is ℓ
n1/2 : value for which speedup is ℓ/2
11 Eijkhout – Processor Architecture – Fall 2023
Applicability of pipelining
Pipelining works for:
vector addition/multiplication
Division/square root maybe pipelined, but much slower
12 Eijkhout – Processor Architecture – Fall 2023
Recurrences
Pipelining does not immediately work:
for (i) {
x[i+1] = a[i]*x[i] + b[i];
}
Transform:
xn+2 = an+1 xn+1 + bn+1
= an+1 (an xn + bn ) + bn+1
= an+1 an xn + an+1 bn + bn+1
13 Eijkhout – Processor Architecture – Fall 2023
Instruction pipeline
Instruction-Level Parallelism: more general notion of independent
instructions
Requires independent instructions
As frequency goes up, pipeline gets longer: more demands on
compiler
14 Eijkhout – Processor Architecture – Fall 2023
Instruction-Level Parallelism
multiple-issue of independent instructions
branch prediction and speculative execution
out-of-order execution
prefetching
Problems: complicated circuitry, hard to maintain performance
15 Eijkhout – Processor Architecture – Fall 2023
Implications
Long pipeline needs many independent instructions:
demands on compiler
Conditionals break the stream of independent instructions
Processor tries to predict branches
branch misprediction penalty:
pipeline needs to be flushed and refilled
avoid conditionals in inner loops!
16 Eijkhout – Processor Architecture – Fall 2023
Instructions
Addition/multiplication: pipelined
Division (and square root): much slower
for ( i )
a[i] = b[i] / c
Can you improve on this?
Fused Multiply-Add (FMA) s += a*b
where can you use this?
17 Eijkhout – Processor Architecture – Fall 2023
Peak performance
Performance is a function of
Clock frequency,
SIMD width
Load/store unit behavior
Floating point capabilities of several processor architectures
DAXPY cycle number for 8 operands
Processor year add/mult/fma units daxpy cycles
(count×width) (arith vs load/store)
MIPS R10000 1996 1×1+1×1+0 8/24
Alpha EV5 1996 1×1+1×1+0 8/12
IBM Power5 2004 0+0+2×1 4/12
AMD Bulldozer 2011 2×2+2×2+0 2/4
Intel Sandy Bridge 2012 1×4+1×4+0 2/4
Intel Haswell 2014 0+0+2×4 1/2
18 Eijkhout – Processor Architecture – Fall 2023
Memory hierarchy: caches, register, TLB.
19 Eijkhout – Processor Architecture – Fall 2023
The Big Story
DRAM memory is slow, so let’s put small SRAM close to the
processor
This helps if data is reused
Does the algorithm have reuse?
Does the implementation reuse data?
20 Eijkhout – Processor Architecture – Fall 2023
Bandwidth and latency
Important theoretical concept:
latency is delay between request for data and availability
bandwidth is rate at which data arrives thereafter
21 Eijkhout – Processor Architecture – Fall 2023
Memory hierarchy
22 Eijkhout – Processor Architecture – Fall 2023
Registers
23 Eijkhout – Processor Architecture – Fall 2023
Computing out of registers
a := b + c
load the value of b from memory into a register,
load the value of c from memory into another register,
compute the sum and write that into yet another register, and
write the sum value back to the memory location of a.
24 Eijkhout – Processor Architecture – Fall 2023
Register usage
Assembly code
(note: Intel two-operand syntax)
addl %eax, %edx
Registers are named
Can be explicitly addressed by the programmer
. . . as opposed to caches.
Assembly coding or inline assembly (compiler dependent)
. . . but typically generated by compiler
25 Eijkhout – Processor Architecture – Fall 2023
Examples of register usage
1. Resident in register
a := b + c
d := a + e
a stays resident in register, avoid store and load
2. subexpression elimination:
t1 = sin(alpha) * x + cos(alpha) * y;
t2 = -cos(alpha) * x + sin(alpha) * y;
becomes:
s = sin(alpha); c = cos(alpha);
t1 = s * x + c * y;
t2 = -c * x + s * y
often done by compiler
26 Eijkhout – Processor Architecture – Fall 2023
Caches
27 Eijkhout – Processor Architecture – Fall 2023
Cache basics
Fast SRAM in between memory and registers: mostly serves data
reuse
... = ... x ..... // instruction using x
......... // several instructions not involving x
... = ... x ..... // instruction using x
load x from memory into cache, and from cache into register;
operate on it;
do the intervening instructions;
request x from memory, but since it is still in the cache, load it
from the cache into register; operate on it.
essential concept: data reuse
28 Eijkhout – Processor Architecture – Fall 2023
Cache levels
Levels 1,2,3(,4): L1, L2, etc.
Increasing size, increasing latency, decreasing bandwidth
(Note: L3/L4 can be fairly big; beware benchmarking)
Cache hit / cache miss: one level is consulted, then the next
L1 has separate data / instruction cache, other levels mixed
Caches do not have enough bandwidth to serve the processor:
coding for reuse on all levels.
29 Eijkhout – Processor Architecture – Fall 2023
Cache misses
Compulsory miss: first time data is referenced
Capacity miss: data was in cache, but has been flushed
(overwritten) by LRU policy
Conflict miss: two items get mapped to the same cache location,
even if there are no capacity problems
Invalidation miss: data becomes invalid because of activity of
another core
30 Eijkhout – Processor Architecture – Fall 2023
Cache hits
Data has been requested, used a second time: temporal locality
⇒ Can’t wait too long between uses
(Data can be loaded because it’s close to data requested: spatial
locality. Later.)
31 Eijkhout – Processor Architecture – Fall 2023
Capacity miss
(Why is that last block going where it is going?)
32 Eijkhout – Processor Architecture – Fall 2023
Cache capacity
Loading data multiple times
LRU: oldest item evicted if needed
Reuse if not too much data
for ( lots of times ) // sequential loop
load and process data // probably parallel loop
33 Eijkhout – Processor Architecture – Fall 2023
Illustration of capacity
12 2.0
10
1.8
8
Cache miss fraction
1.6
cycles per op
6
1.4
4
1.2
2
00 5 10 15 20 25 301.0
dataset size
34 Eijkhout – Processor Architecture – Fall 2023
Replacement policies
What determines where new data goes /
what old data is overwritten?
Least Recently Used (LRU): most common
First-In / First-Out (FIFO): IBM Power4. Not a good idea.
Random Replacement. Sometimes used.
It’s actually more subtle than pure LRU . . .
35 Eijkhout – Processor Architecture – Fall 2023
Cache lines
Memory requests go by byte or word
Memory transfers go by cache line:
typically 64 bytes / 8 double precision numbers
Cache line transfer costs bandwidth
⇒ important to use all elements
36 Eijkhout – Processor Architecture – Fall 2023
Effects of striding
Always 8 numbers transferred
With stride s > 1: 8/s elements used
Loss of efficiency if bandwidth-limited
37 Eijkhout – Processor Architecture – Fall 2023
Cache line use
for (i=0; i<N; i++)
... = ... x[i] ...
for (i=0; i<N; i+=stride)
... = ... x[i] ...
38 Eijkhout – Processor Architecture – Fall 2023
Stride effects
for (i=0,n=0; i<L1WORDS; i++,n+=stride)
array[n] = 2.3*array[n]+1.2;
7 300
6
250
5
cache line utilization
200
total kcycles
4
150
3
100
2
11 2 3 4 5 6 750
stride
39 Eijkhout – Processor Architecture – Fall 2023
Cache mapping
Cache is smaller than memory, so we need a mapping scheme
memory address 7→ cache address
Ideal: any address can go anywhere; LRU policy for replacement
pro: optimal; con: slow, expensive to manufacture
Simple: direct mapping by truncating addresses
pro: fast and cheap; con: I’ll show you in a minute
Practical: limited associativity
‘enough but not too much’
40 Eijkhout – Processor Architecture – Fall 2023
Direct mapping
Direct mapping of 32-bit addresses into a 64K cache
Use last number of bits to find cache address
If you traverse an array, a contiguous chunk will be mapped to
cache without conflict.
If (memory) addresses are cache size apart, they get mapped to
the same cache location
41 Eijkhout – Processor Architecture – Fall 2023
Conflicts
Mapping conflicts in a direct-mapped cache.
42 Eijkhout – Processor Architecture – Fall 2023
The problem with direct mapping
real*8 A(8192,3);
do i=1,512
a(i,3) = ( a(i,1)+a(i,2) )/2
end do
In each iteration 3 elements map to the same cache location:
constant overwriting (‘eviction’, cache thrasing):
low performance
43 Eijkhout – Processor Architecture – Fall 2023
Associative cache mapping
Allow each memory address to go to multiple (but not all) cache
addresses; typically 2,4,8
Prevents problems with multiple arrays
Reasonable fast
Often lower associativity for L1 than L2, L3
Associativity L1 L2
Intel (Woodcrest) 8 8
AMD (Bulldozer) 2 8
44 Eijkhout – Processor Architecture – Fall 2023
Associativity
Associative cache structure
45 Eijkhout – Processor Architecture – Fall 2023
Illustration of associativity
Two caches of 12 elements: direct mapped (left) and 3-way
associative (right)
Direct map: 0–12 is conflict
Associative: no conflict
46 Eijkhout – Processor Architecture – Fall 2023
Associativity in practice
m
∀j : yj = yj + ∑ xi ,j
i =1
47 Eijkhout – Processor Architecture – Fall 2023
The number of L1 cache misses and the number of cycles for each j
One remedy
Do not user powers of 2.
The number of L1 cache misses and the number of cycles for each j
column accumulation, vector length 4096 + 8
48 Eijkhout – Processor Architecture – Fall 2023
Exercise
Write a small cache simulator in your favorite language. Assume a
k -way associative cache of 32 entries and an architecture with 16 bit
addresses.
49 Eijkhout – Processor Architecture – Fall 2023
Exercise: vectorsum
Compare sequential performance to single-threaded OMP
For some problem sizes observe a difference in performance
Use Intel option -qopt-report=3 and inspect the report.
Compare different compilers: Intel 19 behaves differently from 24!
Also gcc13.
for ( int iloop=0; iloop<nloops; ++iloop ) {
for ( int i=0; i<vectorsize; ++i ) {
outvec[i] += invec[i]*loopcoeff[iloop];
}
}
Analyze and report
50 Eijkhout – Processor Architecture – Fall 2023
More memory system topics
51 Eijkhout – Processor Architecture – Fall 2023
Bandwidth / latency
Simple model for sending n words:
t = α + βn
Quoted bandwidth figures are always optimistic:
bandwidth shared between cores
not enough bandwidth for all cores:
⇒ speedup less than linear
bandwidth wasted on coherence
NUMA: pulling data from other socket
assumes optimal scheduling of DRAM banks
52 Eijkhout – Processor Architecture – Fall 2023
Prefetch
Do you have to wait for every item from memory?
Memory controller can infer streams: prefetch
Sometimes controllable through assembly, directives, libraries
(AltiVec)
One form of latency hiding
53 Eijkhout – Processor Architecture – Fall 2023
Memory pages
Memory is organized in pages:
Translation between logical address, as used by program, and
physical in memory
This serves virtual memory and relocatable code
so we need another translation stage.
54 Eijkhout – Processor Architecture – Fall 2023
Page translation: TLB
General page translation: slowish and expensive
Translation Look-aside Buffer (TLB) is a small list of frequently
used pages
Example of spatial locality: items on an already referenced page
are found faster
55 Eijkhout – Processor Architecture – Fall 2023
TLB misses
#define INDEX(i,j,m,n) i+j*m
array = (double*) malloc(m*n*sizeof(double));
/* traversal #2 */
for (i=0; i<m; i++)
for (j=0; j<n; j++)
array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;
56 Eijkhout – Processor Architecture – Fall 2023
TLB hits
#define INDEX(i,j,m,n) i+j*m
array = (double*) malloc(m*n*sizeof(double));
/* traversal #1 */
for (j=0; j<n; j++)
for (i=0; i<m; i++)
array[INDEX(i,j,m,n)] = array[INDEX(i,j,m,n)]+1;
57 Eijkhout – Processor Architecture – Fall 2023
Little’s Law
Item loaded from memory, processed, new item loaded in
response
But this can only happen after latency wait
Items during latency are independent, therefore
Concurrency = Bandwidth × Latency.
58 Eijkhout – Processor Architecture – Fall 2023
Multicore issues
59 Eijkhout – Processor Architecture – Fall 2023
Why multicore
Quest for higher performance:
Not enough instruction parallelism for long pipelines
Two cores at half speed more energy-efficient than one at full
speed.
Multicore solution:
More theoretical performance
Burden for parallelism is now on the programmer
60 Eijkhout – Processor Architecture – Fall 2023
Dennard scaling
Scale down feature size by s:
Feature size ∼s
Voltage ∼s
Current ∼s
Frequency ∼ s−1
Miracle conclusion:
Power = V · I ∼ s2 ; Power density ∼ 1
Everything gets better, cooling problem stays the same
Opportunity for more components, higher frequency
61 Eijkhout – Processor Architecture – Fall 2023
Dynamic power
Charge q = CV
Work W = qV = CV 2 (1)
Power W /time = WF = CV 2 F
Two cores at half frequency:
Cmulti = 2C
Fmulti = F /2 ⇒ Pmulti = P /4.
Vmulti = V /2
Same computation, less power
62 Eijkhout – Processor Architecture – Fall 2023
Multicore caches
63 Eijkhout – Processor Architecture – Fall 2023
The coherence problem
64 Eijkhout – Processor Architecture – Fall 2023
Cache coherence
Modified-Shared-Invalid (MSI) coherence protocol:
Modified: the cacheline has been modified
Shared: the line is present in at least one cache and is
unmodified.
Invalid: the line is not present, or it is present but a copy in
another cache has been modified.
65 Eijkhout – Processor Architecture – Fall 2023
Coherence issues
Coherence is automatic, so you don’t have to worry about it. . .
. . . except when it saps performance
Beware false sharing
writes to different elements of a cache line
66 Eijkhout – Processor Architecture – Fall 2023
Balance analysis
Sandy Bridge core can aborb 300 GB/s
4 DDR3/1600 channels provide 51 GB/s, difference has to come
from reuse
It gets worse: latency 80ns, bandwidth 51 GB/s,
Little’s law: parallelism 64 cache lines
However, each core only has 10 line fill buffers,
so we need 6–7 cores to provide the data for one core
Power: cores are 72%, uncore 17, DRAM 11.
Core power goes 40% to instruction handling, not arithmetic
Time for a redesign of processors and programming; see my
research presentation
67 Eijkhout – Processor Architecture – Fall 2023
Programming strategies for performance
68 Eijkhout – Processor Architecture – Fall 2023
How much performance is possible?
Performance limited by
Processor peak performance: absolute limit
Bandwidth: linear correlation with performance
Arithmetic intensity: ratio of operations per transfer
If AI high enough: processor-limited
otherwise: bandwidth-limited
69 Eijkhout – Processor Architecture – Fall 2023
P
erformance depends on algorithm:
70 Eijkhout – Processor Architecture – Fall 2023
I
nsufficient utilization of functional units:
71 Eijkhout – Processor Architecture – Fall 2023
I
mperfect data transfer:
72 Eijkhout – Processor Architecture – Fall 2023
Spatial and temporal locality
Temporal locality: use an item, use it again but from cache
efficient because second transfer cheaper.
Spatial locality: use an item, then use one ‘close to it’
(for instance from same cacheline)
efficient because item is already reachable even though not used
before.
73 Eijkhout – Processor Architecture – Fall 2023
Architecture aware programming
Cache size: block loops
pipelining and vector instructions: expose streams of instructions
reuse: restructure code (both loop merge and splitting, unroll
TLB: don’t jump all over memory
associativity: watch out for powers of 2
74 Eijkhout – Processor Architecture – Fall 2023
Loop blocking
Multiple passes over data
for ( k< small bound )
for ( i < N )
x[i] = f( x[i], k, .... )
Block to be cache contained
for ( ii < N; ii+= blocksize )
for ( k< small bound )
for ( i=ii; i<ii+blocksize; i++ )
x[i] = f( x[i], k, .... )
This requires independence of operations
75 Eijkhout – Processor Architecture – Fall 2023
The ultimate in performance programming: DGEMM
Matrix-matrix product C = A · B
∀i ∀j ∀k : cij + = aik bkj
Three independent loop i , j , k
all three blocked i ′ , j ′ , k ′
Many loop permutations, blocking factors to choose
76 Eijkhout – Processor Architecture – Fall 2023
DGEMM variant
Inner products
for ( i )
for ( j )
for ( k )
c[i,j] += a[i,k] * b[k,j]
77 Eijkhout – Processor Architecture – Fall 2023
DGEMM variant
Outer product: updates with low-rank columns-times-vector
for ( k )
for ( i )
for ( j )
c[i,j] += a[i,k] * b[k,j]
78 Eijkhout – Processor Architecture – Fall 2023
DGEMM variant
Building up rows by linear combinations
for ( i )
for ( k )
for ( j )
c[i,j] += a[i,k] * b[k,j]
Exchanging i , j: building up columns
79 Eijkhout – Processor Architecture – Fall 2023
Rank 1 updates
C∗∗ = ∑ A∗k Bk ∗
k
80 Eijkhout – Processor Architecture – Fall 2023
Matrix-panel multiply
Block of A times ‘sliver’ of B
81 Eijkhout – Processor Architecture – Fall 2023
Inner algorithm
For inner i:
// compute C[i,*] :
for k:
C[i,*] = A[i,k] * B[k,*]
82 Eijkhout – Processor Architecture – Fall 2023
Tuning
For inner i:
// compute C[i,*] :
for k:
C[i,*] += A[i,k] * B[k,*]
C[i,*] stays in register
A[i,k] and B[k,*] stream from L1
blocksize of A for L2 size
A stored by rows to prevent TLB problems
83 Eijkhout – Processor Architecture – Fall 2023
Cache-oblivious programming
Observation: recursive subdivision will ultimately make a problem
small / well-behaved enough
84 Eijkhout – Processor Architecture – Fall 2023
Cache-oblivious matrix-matrix multiply
C11 C12 A11 A12 B11 B12
=
C21 C22 A21 A22 B21 B22
with C11 = A11 B11 + A12 B21
Recursive approach will be cache contained.
Not as high performance as being cache-aware. . .
85 Eijkhout – Processor Architecture – Fall 2023
The power question
86 Eijkhout – Processor Architecture – Fall 2023