Performance
Victor Eijkhout
Fall 2023
1 Eijkhout – Performance – Fall 2023
Justification
Programming for performance is an art.
Here are some examples.
2 Eijkhout – Performance – Fall 2023
Peak performance
Requires all floating point units to be active
Requires all data in L1 cache, or even register
Rightarrow very hard to achieve.
3 Eijkhout – Performance – Fall 2023
Arithmetic intensity
How many operations per word?
Equivalently: reuse factor = ratio between operations and data
Reuse: algorithm vs implementation
4 Eijkhout – Performance – Fall 2023
Bandwidth-limited operations
‘Streaming’ operations
// bandwidth.cxx
vector<double> results(nthreads,0.);
for ( int t=0; t<nthreads; t++) {
auto start_point = t*stream_length;
threads.push_back
( thread( [=,&results] () {
results[t] = memory.sumstream(
how_many_repeats,stream_length,start_point);
} ) );
}
for ( auto &t : threads )
t.join();
5 Eijkhout – Performance – Fall 2023
Bandwidth measurement
Aggregate bandwidth
12K
256K
25M
1,000
bandwidth
500
0
0 20 40 60
cores
6 Eijkhout – Performance – Fall 2023
Bandwidth numbers strictly a posteriori
Cache size effects
Basic idea: go many times over a small data set.
The following code is too simple:
for (int irepeat=0; irepeat<how_many_repeats; irepeat++)
{
for (int iword=0; iword<cachesize_in_words; iword++)
memory[iword] += 1.1;
}
7 Eijkhout – Performance – Fall 2023
Random traversal
Emulate randomness by pointer chasing
// setup
for (int iword=0; iword<cachesize_in_words; iword++)
memory[iword] = (iword+1) % cachesize_in_words
// use:
ptr = 0
for (int iword=0; iword<cachesize_in_words; iword++)
ptr = memory[ptr];
8 Eijkhout – Performance – Fall 2023
Measurement
Bandwidth
2,500
2,000
bandwidth
1,500
1,000
frontera
500 ls6
linear
0
104 105 106 107
dataset size
9 Eijkhout – Performance – Fall 2023
Associativity
Words at certain distance map to the same associativity class
Example: Ice Like has 48KiB cache, 12-way associative
Rightarrow stride 4KiB gives conflict; 12 conflicts can be resolved
Cascade Lake: 8-way associative
Access time per element
skx
5 clx
icx
4
nsec
4 6 8 10 12 14
collisions
10 Eijkhout – Performance – Fall 2023
Loop tiling
Multiple passes over array
Rewrite as by-block
Rightarrow One extra loop level
for (n=0; n<10; n++) bs = ... /* the blocksize */
for (i=0; i<100000; i++) for (b=0; b<100000/bs; b++)
... = ...x[i] ... for (n=0; n<10; n++)
for (i=b*bs; i<(b+1)*bs;
i++)
... = ...x[i] ...
11 Eijkhout – Performance – Fall 2023
Example transpose
// regular.c
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
A[i][j] = B[j][i];
Does this have any reuse of input or output?
12 Eijkhout – Performance – Fall 2023
Rewrite
// blocked.c
for (int ii=0; ii<N; ii+=blocksize)
for (int jj=0; jj<N; jj+=blocksize)
for (int i=ii*blocksize; i<MIN(N,(ii+1)*blocksize); i++)
for (int j=jj*blocksize; j<MIN(N,(jj+1)*blocksize); j++)
A[i][j] = B[j][i];
13 Eijkhout – Performance – Fall 2023