0% found this document useful (0 votes)

50 views101 pages

Optimising Serial Code

Uploaded by

chewgumnobrains

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views101 pages

Optimising Serial Code

Uploaded by

chewgumnobrains

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

Optimising Serial Code

Advancing Science with

Pawsey
Learning Objectives
• Choose an algorithm for good performance
• Choose a language for good performance
• Understand the importance of standard
conformance
• Write code that can be optimised for a modern
CPU
• Locate bottlenecks in a serial code and address
them
Course Roadmap

Programming Optimization
• correctness/novelty • cache performance
• complexity • FLOPS
• memory scaling • language selection • IPC • data locality
• etc • code efficiency • etc • loop
• standardization transformation
• etc • vectorization
• etc
Algorithm Profiling

Parallelization

Performance
Motivation
• Science will be limited by the code and
computer.
• Development effort may be secondary to
code performance.
• Many gains come with small effort.
• Code optimisation can force good
programming styles.
Is it worth the time?

From XKCD webcomic: https://xkcd.com/1205/

Development Effort
1 Magnus day = 49 years on a dual-core
desktop.
 It is worth it to spend months of development
time to save years/decades of runtime!
 This might go against common opinions.

You still need to consider unit testing, ease

of collaboration and code extension.
Know When to Stop Developing
Focus your effort!

1. Start with a good algorithm.

2. Write code with performance in mind.
3. Profile the code to determine where to spend
your optimising effort.
4. Optimise the code, then profile again, and repeat
until satisfactory performance achieved.

Know when to stop trying! Set an effort limit or a

runtime goal.
MEASURING PERFORMANCE
Busy vs Effective
• Low processor utilisation does mean sub-
optimal performance.
• High processor utilisation does not mean
optimal performance.
• Why?
• processor could be doing something
inefficient
• processor could be polling
Time
• What matters is real (your) time. Use the
clock, we call this walltime.
• You can use routines like gettimeofday,
MPI_Wtime to measure program runtime.
• Don’t use many timing calls to profile the
code. This is slow and alters the profile.
Instead use a profiler, which we will do
later!
• CPU clock speed can vary between runs!
Timing routines
• C / C++
clock_t clock(void): Returns CPU time in “clock ticks” since initial call. (Use
CLOCKS_PER_SEC to convert.). Resolution is microseconds.

time_t time( time_t *second): Returns real time in seconds since unix epoch 00:00:00 Jan
1, 1970. (Output argument “second” is also assigned with that value if it is not a NULL
pointer.). Resolution is in seconds.
• Fortran
CPU_TIME(TIME): Returns the processor time in seconds since the start of the program
through “TIME”, which is a REAL with INTENT(OUT). Resolution in miliseconds.

DATE_AND_TIME([date,time,zone,values]): Returns character and binary data on the

real-time clock and date. Resolution is miliseconds.
• glibc
int gettimeofday(struct timeval *tp, struct timezone *tzp): Returns the current calendar
time as the elapsed time since the epoch in the structure “tp”. Resolution is microseconds.
• MPI
double MPI_Wtime( void ): Returns time in seconds since an arbitrary time in the past.
Resolution is indicated by: double MPI_Wtick( void ).
CHOOSING AN ALGORITHM
Algorithms
Many common algorithms have:
• a period of time
• increments within that period
• elements of interest
Algorithm For t = t1 .. tn {over time increments}
• ways to visit these elements A For i = x1 .. xn
• a calculation <do something> For j = y1 .. yn
Grid For k = z1 .. zn
<Do something>

Algorithm For t = t1 .. tn {over time increments}

B For each element
For each other element
N-body <Do something>
Complexity
At this stage, we can make a few observations:

Algorithm A will do #t.#x.#y.#z calculations = O(t*n3)

Algorithm B will do #t.n.(n-1) calculations = O(t*n2)

… this is the order or algorithmic complexity.

Scaling
• Consider memory scaling as well as
compute scaling. Memory is finite.
Compute Time Memory
3D-grid with time O(t*n3) O(n3)
N-body O(t*n2) O(n)

• Other examples: QR tridiagonal

eigensolver is O(n2) in memory, while MR3
tridiagonal eigensolver is O(n) in memory.
Example Alternative Algorithm
• Instead of calculating every pairwise interaction
(brute force) in the N-body problem, represent
hierarchies of bodies with multipoles in a
spacial decomposition (influence of very distant
bodies is negligible).
• O(N2) interactions becomes O(N log N), even
O(N) for adaptive expansion.
N Brute force O(N2) Multipole model O(N log N)

1,000 1,000,000 7,000

Choosing an Algorithm
There may be multiple algorithms to solve a
problem. Consider:
• E.g. O(n3) vs O(n log n) or Brute Force vs
Clever approach with adequate results.
• Parallelisability constraints imposed by
the algorithm.
• Domain decomposition often leads to
good scaling and parallelisability. Load
balancing may become an issue.
Prefactor
• If two algorithms have the same prefactor,
does one have fewer expensive
operations? E.g. square roots,
exponentials, complex numbers, sin, cos
etc.
• Algorithms that scale well might have a
more expensive prefactor and be slower
for small n. Perhaps use two algorithms
and set a transition point based on n.
Rewrite your Maths
Simple Tricks
CHOOSING A LANGUAGE
Choosing a language
Choose the right language for the job.

• Performance.
• Potential to use OpenMP, OpenACC, CUDA, and/or
MPI.
• Flexible to new technologies.
• Portable.
• Available performance and debugging tools.
• Ease of programming.

Perhaps a hybrid approach, e.g. python/C or

python/Fortran
Other Language Considerations
Our Goals:
• Want to convert algorithms into code (without too
much effort).
• Want to help the compiler produce the fastest
code (without too much effort).

Some Considerations:
• Array / matrix support
• Complex number support
• Aliasing
• Static typing
Aliasing
Aliasing is when some variables *might* refer to the
same memory location. This introduces constraints on
compiler optimisations. (e.g. prevents code reordering).
• Fortran
Assumes no aliasing. Avoid unnecessary use of pointers
(stick to allocatable).
• C
Use the restrict keyword, since C99 standard.
• C++
__restrict__ etc are non-standard but in some compilers.
Not portable.

Avoid unnecessary use of pointers.

Static Typing
• With static typing, the type is known at
compile-time.
• Compiler can check that variables passed to
functions are compatible with the functions.

• Permits optimisations at compile-time.

• Improves reproducibility, reduces errors.
• API is well documented for others.
• Fortran, C, C++ all support static typing.
Language performance
Fortran Julia Python R Matlab Octave Mathematica JavaScript Go SciLua Java

gcc 1.0.0- 1.8.0_1

1.0.0 3.6.3 3.5.0 R2018a 4.2.2 11.3 V8 6.2.414 go1.9
7.3.1 b12 7
fibonacci 0.98 1.32 94.2 263 17.6 10045 132.1 3.51 1.80 1.18 3.63
parse_integer 6.87 2.19 19.76 50.37 178.2 574.4 22.66 5.03 0.96 0.97 3.17
quicksort 1.18 0.99 37.57 57.93 2.36 2221.3 44.48 4.28 1.24 1.56 2.98
mandelbrot 0.70 0.68 65.66 195.5 9.84 5812 18.29 1.134 0.77 1.00 1.42
pi_sum 1.00 1.01 14.77 11.69 1.01 317.5 1.45 1.08 1.00 0.99 1.08

matrix_statistics 1.54 1.63 17.73 20.97 8.09 46.24 7.49 13.97 6.08 1.70 5.02

matrix_multiply 1.15 0.97 1.18 8.26 1.16 1.21 1.18 31.77 1.43 1.08 8.07

Comparison from https://julialang.org/benchmarks/.

Times relative to C (gcc 7.3.1). Lower is better.
Ran on a single core (serial execution) on an Intel® Core™ i7-3960X 3.30GHz CPU with
64GB of 1600MHz DDR3 RAM, running openSUSE LEAP 15.0 Linux.

These are not optimal results, just how a typical researcher might program in those
languages.
Fortran Compiler Comparison
GCC Intel Cray
fibonacci 1 4.41 3.16
parse_integer 1 0.77 0.77
mandelbrot 1 0.50 -
quicksort 1 0.91 0.99
pi_sum 1 0.55 0.55
matrix_statistics 1 0.80 0.47
matrix_multiply 1 0.75 0.15

Comparison code perf.f90 from https://julialang.org/benchmarks/.

Runtimes normalised to the GNU runtimes. Lower is better.
Results on Magnus, all using “ftn –O3 perf.f90” under relevant compiler environments.
STANDARDS CONFORMANCE
Standard Conformance
• Your code will run on different systems and
different compilers over the years.
• Standards conformance significantly improves
the chance of reproducibility. (good for
science!)
• Reproducibility means “apples vs apples”
performance comparisons between systems
and compilers, and between profiling runs.

• Do not rely on compiler extensions!

How to write portable code (1)
• Use your compiler to check.
• gcc -std=c99 -pedantic myfile.c
• gcc -std=c++98 -pedantic myfile.cxx
• gfortran -std=f95 -pedantic myfile.f90

• Develop with multiple compilers. Cray

compilers are very pedantic.

• Download a copy of the standard (or a

draft).
How to write portable code (2)
• Try initialising variables to other values,
do you get the same answer?.
• gfortran -finit-real=inf -finit-logical=false -finit-
character=t …
• Try runtime checking. This is slower than
compile-time checking.
• gfortran -fcheck=all
Portable Makefiles
• Don’t hard code compiler-specific flags.
• There is no standard for compiler flags, just for the
code itself.
• -Wall is not portable among compilers

• Use Makefile variables, make them easy to find

and modify:
CFLAGS=-g -Wall
#CFLAGS=-O3

http://www.gnu.org/software/make/manual/
Exercise 1
Exercises are publicly available via Github:

git clone https://github.com/PawseySC/Optimising-Serial-Code.git

It’s case sensitive. Download them!

Have a look to the various “Makefile”s.

Exercise 2: matmul (1)
In the exercise directory:
cd matrix && make matrix
This produces the executables matrix.O0,
matrix.O1, matrix.O2, matrix.O3
(Have a quick look at the Makefile).
Run them through the queue:
sbatch run_matrices.slurm
Output is in the SLURM output file slurm-
JOBID.out
Exercise 2: matmul (2)
Compare the timings:
• What effect does optimisation level have on
calls to the external math routine dgemm, to
the intrinsic matmul, to manual looping of
matrix multiplication?
• Is there a single best method for any matrix
size at high optimisation?
• What is the main difference between
matmul3 and matmul4.
• If you have time, try with a different compiler.
MODERN COMPUTERS AND
OPTIMISATION
CPUs
• Most CPUs have multi-level caches, to reduce the time
taken to access data.
• A CPU can do a lot of work in the time it takes to access
main memory.
Data Locality
Location Access time Access time (cycles)
Register <1ns -
L1 cache 1ns 4
L2 cache 4ns 10
L3 cache 15-30ns 40-75
Memory 60ns 150
Solid state disk 50us 130,000
Hard disk 10ms 26,000,000
Tape 10sec 26,000,000,000

Source: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Cache: Access Patterns
• Sequential access results in a higher rate of cache hits

• Striding access has a low rate of hits, however modern processors

can detect striding access patterns and pre-fetch cache lines

• Random access is typically the worst, with a low rate of hits and no
ability to predict subsequent access locations
Translation Lookaside Buffer
• The Translation Page Table maps the memory seen
by the program (Virtual Memory) into physical
memory. It sits in main memory and is slow.
• The Translation Lookaside Buffer is a cache of
recently used mappings (not actual content). Like
the main caches, aim to work within VM pages.
These are typically 4kB.

• On a Cray you can use larger pages.

• Before compiling: module load craype-
hugepages2M (size does not matter)
• Runtime: module load craype-hugepagesXXXM
(size does matter)
Being cache-friendly
• Read and write to contiguous chunks of
memory. Data is transferred in a cache
line.
• Avoid cache misses.
Writing to Memory
• Don’t store data in RAM if you don’t need
to. Use local temporary variables instead.
• These could be optimised into registers.
• In particular, don’t use global variables as
local temporary variables.
• Similarly, avoid using array sections for
temporary storage.
Instruction Pipelining
• Modern CPUs can complete multiple
Instructions Per Cycle (IPC), also known
as Instructions Per Clock.
• An average for Xeon is 4. You need to
keep the pipeline full to achieve this.
Loop Unrolling
Keep data in registers/cache via loop unrolling /
blocking with inlining. This can also improve IPC.

e.g. a(n)=somefunc(n) + a(n-1)

Unroll with stride 2:

do n=1,nmax,2
a(n)=somefunc(n) + a(n-1)
a(n+1)=somefunc(n+1) + a(n)
end do
Loop Counts
• Inner loops should have high iteration counts,
since loops themselves have a non-negligible
cost.
• Counter to this, very small inner loops may get
unrolled away. In this case do it manually.
Good Bad
do j=1,10 do n=1,10000
do n=1,10000 do j=1,10
work work
end do end do
end do end do
Knowing Loop Counts
• There is more potential for compiler
optimisation when the loop count is
known before the loop is started.
• Use “do”, “for” loops. Avoid “while” and
“do … while” loops.
Contiguous Memory
• Aim to work through contiguous chunks of
memory. Avoid unnecessary striding.
• In Fortran, the consecutive elements of a column
reside next to each other (column-major order).
• In C/C++, the consecutive elements of a row
reside next to each other (row-major order)

Fortran C /C++
do j=1,10 for (i=0;i<10;i++)
do i=1,10 for (j=0;j<10;j++)
A(i,j)=something a[i][j]=something
Loop Blocking (1)
Loop blocking/tiling/strip-mining …

This code potentially contains many cache misses.

do J=1,Jmax
do I=1,Jmax
A(I,J) = A(I,J) + B(J,I)
end do
end do

A has stride 1, B has stride Jmax.

• Make the stride of B smaller so that small blocks of A and B sit in

cache.
• If Imax and Jmax are very large, neither A or B can sit in cache in
whole.
Loop Blocking (2)
B(I,J) access pattern
B(I,J) access pattern: non contiguous
after blocking

A(I,J) access pattern: contiguous A(I,J) access pattern

after blocking

• The Cray Fortran compiler does this for you.

Most other compilers currently do not.
Loop Blocking (3)
After applying loop blocking technique
do J=1,Jblksize,Jmax
do I=1,Iblksize,Imax
do JJ=J,J+Jblksize
do II=I,I+Iblksize
A(II,JJ) = A(II,JJ) + B(JJ,II)
end do
end do
end do
end do

A has stride 1, B has stride Jblksize.

• Make the stride of B smaller so that small blocks of A and B sit in

cache.
• Only Iblksize x Jblksize of A or B sit in cache. Not exhausting cache.
Branching
• Remove branches from loops and change the loop
bounds. Branches are bad for IPC and pre-emptive
cache fetching.
• Avoid GOTO statements (in Fortran, C, C++). They can
affect cache/register use.

Before: After:
do n=1,nmax a(1)=0
if (n==1) a(n)=0 a(nmax)=1
a(n)=somefunc(n) do n=2,nmax-1
if (n==nmax) a(n)=1 a(n)=somefunc(n)
end do end do
Loop Fission
• Too much work in loops means that registers and/or
instruction cache may get exhausted.
• Perhaps only part of a loop is vectorisable (execute a
number of loop-iterations at the same time).
• Break these loops up.

Before: After:
do n=1,nmax do n=1,nmax
lots of work lots of work
some I/O end do
end do do n=1,nmax
some I/O
end do
Loop Fusion
Before: After:
do n=1,nmax do n=1,nmax
a(n)=somefunc(n) a(n)=somefunc(n)
end do b(n)=someotherfunc(n)
do n=1,nmax end do
b(n)=someotherfunc(n)
end do

• This is usually used in conjunction with other

techniques.
• Whether this is beneficial depends on the
work inside the loops. Too much work may
exhaust registers or cache.
Pointers
Avoid unnecessary use of pointers.
• Pointers might prevent some compiler
optimisations. They should be fine if they
have local scope.
• Pointers make it difficult to copy data to
GPU memory.
• Hard to optimise cache use, e.g. with
linked lists vs contiguous arrays.
MMX / 3DNOW / SSE / AVX
• These are Single-Instruction-Multiple-Data (SIMD)
operations.
• These are not part of standard Fortran / C / C++. They may
get deprecated over time.
• Compiler should insert these automatically. Write SIMD-
friendly code.
Inlining
• Function calls take time. You can remove this
time by placing the code into the calling routine.
• Compilers can inline code for you when using
high optimization levels. This is not guaranteed,
but you can make it more likely:

• In C / C++, put the code in the same file as the

calling routine. Use static functions.
• In C99 / C++, use the inline keyword.
• In Fortran, put the code in the same file / module
as the calling routine. Use the compiler directive
“forceinline”.
COMPILER OUTPUT
Cray Compiler Output
Cray compiler will output annotated version of source file
ftn -rm mycode.f90
Outputs mycode.lst
Examine annotated file to figure out what’s going on
Intel Compiler Output
• Optimisation reports.
• Compiler flag: -qopt-report=3

• Have a look at the man page for other

values to opt-report.
Exercise 3: Cray compiler output
• Run:
cd compiler_reports
make matrix.cray

• Check out the manpage if needed

• man crayftn
• Examine the output in matrix.lst.O*
• What has the compiler done with routine calls and
loops?
• Can you identify the reason of the timing results
from the previous exercise?
Exercise 4: Intel compiler output
• Run:
module swap PrgEnv-cray PrgEnv-intel
cd compiler_reports
make matrix.intel

• Seek help: ifort -help reports

• Examine the output in matrix.optrpt.O*
• Might be a bit too much information. Scale
back the reporting options in Makefile
PROFILING
Profiling phases
• Instrumentation: compile the source code
with extra compiler flags that enable the
recording of performance-relevant events.
• Measurement & analysis: run the
instrumented application on a
representative test case. Usually the
instrumented application is much slower
than the original one.
• Performance examination: collect and
analyse the measurement results.
Profilers
• On Cray supercomputers: Cray Tools
https://pubs.cray.com/content/S-2376/7.0.0/cray-performance-measurement-
and-analysis-tools-user-guide/craypat
https://support.pawsey.org.au/documentation/display/US/Profiling+with+Cray+T
ools

• Intel VTune: https://software.intel.com/en-us/intel-vtune-amplifier-xe

https://support.pawsey.org.au/documentation/display/US/Profiling+with+Intel+V
Tune

• Others: gprof, Arm MAP & Performance reports, etc.

https://support.pawsey.org.au/documentation/display/US/User+Training
https://support.pawsey.org.au/documentation/display/US/Training+Material
Profilers
• Profiling guide at Pawsey:
https://support.pawsey.org.au/documentation/display/US/Profiling
Full Profiling with CrayPAT
Sampling experiment
Instrumentation
• module load perftools
• Compile code, using Cray compiler wrappers (ftn, cc, CC) &
preserving object (.o) files
• pat_build myapp
• Generates executable named myapp+pat
Measurement & analysis
• Run ./myapp+pat as normal, this will generate an
• Output dir: myapp+pat+XX+YYs/ (or .xf file for small runs)
• pat_report myapp+pat+XX+YYs/ > myapp.sampling.report
(this also generates .ap2 file that can be viewed with Apprentice2,
and a build-options.apa file to be used in a tracing experiment)
Performance examination
• Read myapp.sampling.report file
• or use Apprentice2 (with X11 forwarding activated: “ssh –X”):
app2 myapp+pat+XX+YYs/ &
Full Profiling with CrayPat
Tracing experiment
Instrumentation
• First, perform a sampling experiment to generate the file:
myapp+pat+XX+YYs/build-options.apa
• pat_build -O myapp+pat+XX+YYs/build-options.apa
• Essentially pat_build -w -T funcs -g grps -u myapp (can be
edited to change –T funcs and –g grps)
• Generates executable named myapp+apa
Measurement & analysis
• Run ./myyapp+apa as normal, this will generate an
• Output dir: myapp+apa+XX+YYt/ (or .xf file for small runs)
• pat_report myapp+apa+XX+YYt/ > myapp.tracing.report
(also generates .ap2 file that can be viewed with Apprentice2)
Performance examination
• Read myapp.tracing.report file
• or use Apprentice2 (with X11 forwarding activated: “ssh –X”):
app2 myapp+apa+XX+YYt/ &
Exercise 5: profiling game of life (1)
Profile (sampling) the game_of_life code.
• module load perftools
• cd game_of_life
• make game_of_life.cray
• pat_build game_of_life.O3.cray
• sbatch run_game_profile.slurm

• pat_report game_of_life.O3.cray+pat+XX-YYs/ >

game_of_life.O3.cray.sampling.report
Exercise 5: profiling game of life (2)
Examine
game_of_life.O3.cray.sampling.report

• Where is all the time spent? What occurs on

these lines of the code?
• How good is our cache utilisation?
• Check optimisations in
game_of_life.O3.lst
• Try again with other levels of optimisation
(edit the Makefile)
Exercise 6: profiling matrix
multiplication (1) - sampling
Sampling experiment
• module load perftools
• cd profiling
• ftn -c –O2 matrix.f90
• ftn -o matrix.O2 matrix.o
• pat_build matrix.O2
• sbatch run_matrix_profile.slurm
• Generate the report:
pat_report matrix.O2+pat+XX+YYs/ > matrix.O2.sampling.report
• Examine matrix.O2.sampling.report
• You can also use aprentice2: app2 matrix.O2+pat+XX+YYs/ &
Exercise 6: profiling matrix (2) sampling
Exercise 6: profiling matrix (3) sampling
Exercise 6: profiling matrix (4) sampling
Exercise 6: profiling matrix
multiplication (5) - tracing
Tracing experiment
• Edit the build_options file to define tracing options:
vim matrix.O2+pat+XX-YYs/build_options.apa

Have the following:

-g mpi,blas,io,heap
-T test_matmul1$timeit_
-T test_matmul2$timeit_
-T test_matmul3$timeit_
-T test_matmul4$timeit_
-T test_matmul5$timeit_

• Build the executable with the defined tracing options:

pat_build -O matrix.O2+pat+XX-YYs/build_options.apa
Exercise 6: profiling matrix (6) tracing
• Edit the job script: vim run_matrix_profile.slurm
Have: expType=apa
• sbatch run_matrix_profile.slurm
• Generate the report:
pat_report matrix.O2+apa+XX-YYt/ > matrix.O2.tracing.report
• Examine: matrix.O2.tracing.report
• You can also use aprentice2: app2 matrix.O2+apa+XX+YYt/ &
• Use the previous exercises of the matrix multiplication (timing and
optimisation listing) to understand the results.
• Repeat the exercise with lower levels of optimisation.
Seek help
• man pat_build
• man pat_report
• pat_help all . > all_pat_help
Exercise 6: profiling matrix (7) tracing
Exercise 6: profiling matrix (8) tracing
Exercise 6: profiling matrix (9) tracing
Exercise 6: profiling matrix (10) tracing
Exercise 6: profiling matrix (11) tracing
Exercise 6: profiling matrix (12) tracing
From apprentice2: app2 matrix.O2+apa+XX+YYt/ &
Exercise 6: profiling matrix (13) tracing
Exercise 6: profiling matrix
multiplication (14) – tracing
Exercise 6: profiling matrix
multiplication (15) – tracing
Exercise 6: profiling matrix
multiplication (16) – tracing
CODING HABITS
Global variables
• Avoid global variables unless necessary.
They may make it difficult to convert to
multithreaded code in further development.

• Pass variables through routine calls. (There is a

slight performance overhead).
In Fortran arguments can be given intent(in),
intent(out) attributes. This assists the compiler.
Scoping in OpenMP becomes much easier.
May assist in auto-threading by compilers.
Parentheses in Fortran
• Try to avoid parentheses in Fortran; they force an
evaluation and prevent code arithmetic
rearrangements. Use temporary variables
instead.

• Compiler not permitted to rearrange this:

a = 2 * (c + d) – 2 * e

• Compiler allowed to rearrange this:

tmp=c+d
a=2*tmp – 2 * e
Fortran Array Notation
Fortran array notation is convenient and easy to read, but current compilers are
likely to not optimise them well.

Some compilers are unlikely to fuse these operations:

A(:,:)=1.0
C(:,:)=A(:,:)+B(:,:)

In the meantime:
do j,1,n
do i=1,m
A(i,j)=1.0
C(i,j)=A(i,j)+B(i,j)
end do
end do

The Cray compiler does fuse array notation!

Low Level Object Oriented
Low level Object Oriented programming has the potential for
poor performance. E.g. the below strided (not contiguous)
memory accesses.
type atom_type
integer :: atomic_number
double precision :: mass
double precision, dimension(3) :: position
end type atom_type

type(atom_type), dimension(n_atoms) :: atom_list

total_mass=0
do i=1,n_atoms
total_mass=total_mass+atom_list(i)%mass
end do
Low Level Object Oriented (2)
Less organised but faster code:

integer, dimension(n_atoms) :: atomic_numbers

double precision, dimension(n_atoms) :: masses
double precision, dimension(3,n_atoms) :: positions

total_mass=sum(masses)

Set a level for the trade-off between

maintainability/extensibility and performance.
Special Case Code
Assume we have a code that handles arrays of
varying length, and;
• the code creates temporary arrays;
• in practice an array of length 1 is the most
common situation.

Optimisation: write a separate routine for arrays

of length 1, and use temporary variables rather
than temporary arrays.
I/O
• Often the processor is doing little while
waiting for I/O.
• Ways to reduce I/O overhead:
• Use buffering (or don’t turn it off or flush).
• Output in binary, not formatted text.
• Use I/O libraries.
• Hierarchical Data Format (HDF5) is the
name of a set of file formats and libraries
designed to store and organize large
amounts of numerical data.
Exercise 7: I/O
Observe the effects of I/O techniques on
performance.

• module load cray-hdf5

• cd iobench && make iobench_hdf5
• sbatch run_iobench_hdf5.slurm

Look at the SLURM output file.

Use version control
• Some of your attempts at optimisation will
need to be undone. E.g. due to:
• Incorrect results.
• Slower performance.

• Use version control software. E.g. git,

subversion. You should be using this
anyway.
• Use informative comments in check-in.
MATH LIBRARIES
Popular libraries
• BLAS: basic linear algebra such as matrix-vector or matrix-matrix
operations.

• LAPACK:
• Simple matrix/vector operations
• Linear equations solvers
• Linear least squares
• Eigensolvers
• Singular value decomposition
• Real + Complex

• FFTW: fast fourier transforms, real/complex

Optimised vendor versions available. e.g. Intel MKL, Cray Libsci, SGI
SCSL, IBM ESSL. Some are multi-threaded.
Other libraries
• PLAPACK - better scaling eigensolver
(MRRR algorithm)
• PARPACK – sparse eigensolver
• MUMPS – parallel sparse direct solver
• Hypre – parallel linear solver
• Scotch – graph partitioning
• SuperLU – parallel sparse linear solver

• available at cray-tpsl
Intel Math Libraries
Intel MKL includes BLAS, LAPACK and FFT
libraries. It consists of multiple libraries – use the
Intel advisor to work out the compiler link options:

http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.html

Example output:
-L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -
lmkl_sequential -lmkl_core –lpthread -lm

$(MKLROOT) is set by “module load intel”

PetSc / Slepc
• It's an attempt at a black box solver suite.
(makes it easy to swap between various
solvers in various libraries). It links to
common libraries.

• C/C++ and Fortran interfaces

• Linear equation solvers
• Eigensolvers
• Dense and Sparse
• various finite element solver add-ons
Finish
• What’s next?:
• Come to the parallel course.
• Read optimisation guides from vendors.
Intel and Cray in particular.

• Slides are available at

https://support.pawsey.org.au/documentation/display/US/Tr
aining+Material

Lec01 1 Introduction
No ratings yet
Lec01 1 Introduction
36 pages
C Program Optimization Guide
No ratings yet
C Program Optimization Guide
2 pages
Python in Large-Scale Linear Algebra
No ratings yet
Python in Large-Scale Linear Algebra
11 pages
Fulltext PDF
No ratings yet
Fulltext PDF
32 pages
Introduction To Scientific Programming
No ratings yet
Introduction To Scientific Programming
17 pages
CSC 207
No ratings yet
CSC 207
59 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
37 pages
20.optimization I
No ratings yet
20.optimization I
55 pages
15-745 Optimizing Compilers: What Is A Compiler?
No ratings yet
15-745 Optimizing Compilers: What Is A Compiler?
13 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Program Optimization Techniques
No ratings yet
Program Optimization Techniques
35 pages
Lec02 1 Measuring Profiling
No ratings yet
Lec02 1 Measuring Profiling
25 pages
Profile Guided Optimization Guide
No ratings yet
Profile Guided Optimization Guide
47 pages
Lecture1 Clean
No ratings yet
Lecture1 Clean
43 pages
CUITM205 Algorithms
No ratings yet
CUITM205 Algorithms
28 pages
Module 1
No ratings yet
Module 1
68 pages
Algorithm Performance & Analysis
No ratings yet
Algorithm Performance & Analysis
32 pages
Performance and Tuning of Openmp Programs
No ratings yet
Performance and Tuning of Openmp Programs
76 pages
Code Optimization for Developers
No ratings yet
Code Optimization for Developers
11 pages
Unit-1 PPT (DS)
No ratings yet
Unit-1 PPT (DS)
215 pages
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
No ratings yet
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
8 pages
Lec2-Analyzing Algos New
No ratings yet
Lec2-Analyzing Algos New
69 pages
08 - Mixedprogramming: 1 Mixed Programming
No ratings yet
08 - Mixedprogramming: 1 Mixed Programming
41 pages
Unit 1 - Data Structures
No ratings yet
Unit 1 - Data Structures
30 pages
Module 1: Introduction and Asymptotic Analysis: CS 240 - Data Structures and Data Management
No ratings yet
Module 1: Introduction and Asymptotic Analysis: CS 240 - Data Structures and Data Management
48 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Draft: An Empirical Comparison of C, C++, Java, Perl, Python, Rexx, and TCL For A Search/string-Processing Program
No ratings yet
Draft: An Empirical Comparison of C, C++, Java, Perl, Python, Rexx, and TCL For A Search/string-Processing Program
31 pages
(Steven S. Muchnick) Advanced Compiler Design and
79% (14)
(Steven S. Muchnick) Advanced Compiler Design and
887 pages
Complexity
No ratings yet
Complexity
51 pages
Intro To Algorithms
No ratings yet
Intro To Algorithms
46 pages
DSA ch1
No ratings yet
DSA ch1
68 pages
Algorithm Basics - 090158
No ratings yet
Algorithm Basics - 090158
17 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Julia for Scientific Computing Experts
No ratings yet
Julia for Scientific Computing Experts
34 pages
Advanced Python and HPC Optimization
No ratings yet
Advanced Python and HPC Optimization
70 pages
CF0810 17102024 034051pm
No ratings yet
CF0810 17102024 034051pm
27 pages
Lec-2,3,4 Time Complexity
No ratings yet
Lec-2,3,4 Time Complexity
48 pages
C Optimization Techniques
No ratings yet
C Optimization Techniques
79 pages
Are Scripting Languages Any Good? A Validation of Perl, Python, Rexx, and TCL Against C, C++, and Java
No ratings yet
Are Scripting Languages Any Good? A Validation of Perl, Python, Rexx, and TCL Against C, C++, and Java
62 pages
Unit 2 Complete With Assignment 2
No ratings yet
Unit 2 Complete With Assignment 2
77 pages
Performance Analysis of Algorithms
100% (2)
Performance Analysis of Algorithms
19 pages
SWE20004 Technical Software Development: Introduction To Fundamental Concepts
No ratings yet
SWE20004 Technical Software Development: Introduction To Fundamental Concepts
46 pages
Dsa - Barnette and Tonga - 5
No ratings yet
Dsa - Barnette and Tonga - 5
3 pages
Lecture1 Comp 202 Dsa
No ratings yet
Lecture1 Comp 202 Dsa
32 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Programming With Data Structures - 020555
No ratings yet
Programming With Data Structures - 020555
54 pages
Data Structures for BCA Students
No ratings yet
Data Structures for BCA Students
20 pages
Chapter 2
No ratings yet
Chapter 2
173 pages
OpenMP Performance Consideration
No ratings yet
OpenMP Performance Consideration
49 pages
Roadmap
No ratings yet
Roadmap
8 pages
Niema Jemal UGR - 0407 - 17 Introduction To Programing
No ratings yet
Niema Jemal UGR - 0407 - 17 Introduction To Programing
7 pages
Book 9 Advanced Topics
No ratings yet
Book 9 Advanced Topics
204 pages
Problem Solving Techunit1
No ratings yet
Problem Solving Techunit1
25 pages
Chapter 1 - Introduction To Algorithms
No ratings yet
Chapter 1 - Introduction To Algorithms
38 pages
Ads Notes Units 1 2 NOTES
No ratings yet
Ads Notes Units 1 2 NOTES
44 pages
Home Work 3: Class: M.C.A SECTION: RE3004 Course Code: CAP211
No ratings yet
Home Work 3: Class: M.C.A SECTION: RE3004 Course Code: CAP211
15 pages
SMS-Controlled Motor System
No ratings yet
SMS-Controlled Motor System
6 pages
Knowledge Management and Clinical Practice - SGPGI Case Study
100% (1)
Knowledge Management and Clinical Practice - SGPGI Case Study
38 pages
Change Impact Analysis Checklist
No ratings yet
Change Impact Analysis Checklist
4 pages
Chapter 4.1 Install openLDAP For Windows PDF
No ratings yet
Chapter 4.1 Install openLDAP For Windows PDF
6 pages
Resume Contoh
No ratings yet
Resume Contoh
3 pages
Three Address Codes
No ratings yet
Three Address Codes
5 pages
MicroLoan Project
No ratings yet
MicroLoan Project
6 pages
Malaysian Student Scholarship Form
No ratings yet
Malaysian Student Scholarship Form
7 pages
SchoolBus Web Studyguide 2019
100% (1)
SchoolBus Web Studyguide 2019
44 pages
Business Cash Control Strategies
No ratings yet
Business Cash Control Strategies
9 pages
Keri Pearlson and Carol Saunders: Cha Pter 1
No ratings yet
Keri Pearlson and Carol Saunders: Cha Pter 1
35 pages
List of Authorities
No ratings yet
List of Authorities
5 pages
What Is Configure, Price, Quote (CPQ)
No ratings yet
What Is Configure, Price, Quote (CPQ)
6 pages
Java - Developer
No ratings yet
Java - Developer
4 pages
F02 - AC02 - Complex Number Impedance Admttance and Power For AC Circuit
No ratings yet
F02 - AC02 - Complex Number Impedance Admttance and Power For AC Circuit
70 pages
Datasheet ModEvaPac en V2.3
No ratings yet
Datasheet ModEvaPac en V2.3
6 pages
ECDIS Passage Planning Guide
No ratings yet
ECDIS Passage Planning Guide
5 pages
Curriculum Vitae: Career Objective
No ratings yet
Curriculum Vitae: Career Objective
3 pages
Flow, - Mass Flow, - Level, - Pressure, - Conductivity, - pH-Sensor, - Viscosity, - Humidity
No ratings yet
Flow, - Mass Flow, - Level, - Pressure, - Conductivity, - pH-Sensor, - Viscosity, - Humidity
40 pages
0418 w04 Ms 1
No ratings yet
0418 w04 Ms 1
8 pages
Design Engineering GTU
No ratings yet
Design Engineering GTU
16 pages
Demo Imperial
No ratings yet
Demo Imperial
11 pages
CompTIA Server
No ratings yet
CompTIA Server
3 pages
Bangladesh Auditor Application Form
No ratings yet
Bangladesh Auditor Application Form
2 pages
A Note On The "Implicit" Method For Finite-Difference Heat-Transfer Calculations
No ratings yet
A Note On The "Implicit" Method For Finite-Difference Heat-Transfer Calculations
2 pages
(Ebooks PDF) Download Discrete Mathematics and Applications 2nd Edition Ferland Full Chapters
100% (11)
(Ebooks PDF) Download Discrete Mathematics and Applications 2nd Edition Ferland Full Chapters
55 pages
Industrial Network Switch Guide
No ratings yet
Industrial Network Switch Guide
6 pages
TOEIC Answer Sheet
No ratings yet
TOEIC Answer Sheet
2 pages
ANTLMonitoring - MRBTS-525536 - MED - Bello Sauces - WN19 - FL19 - GF19 - SBTS19B - ENB - 0000 - 001696 - 000000 - 20210211-1157
No ratings yet
ANTLMonitoring - MRBTS-525536 - MED - Bello Sauces - WN19 - FL19 - GF19 - SBTS19B - ENB - 0000 - 001696 - 000000 - 20210211-1157
6 pages
Daniel K. Schneider
No ratings yet
Daniel K. Schneider
363 pages

Optimising Serial Code

Uploaded by

Optimising Serial Code

Uploaded by

Optimising Serial Code

Advancing Science with

From XKCD webcomic: https://xkcd.com/1205/

You still need to consider unit testing, ease

1. Start with a good algorithm.

Know when to stop trying! Set an effort limit or a

DATE_AND_TIME([date,time,zone,values]): Returns character and binary data on the

Algorithm For t = t1 .. tn {over time increments}

Algorithm A will do #t.#x.#y.#z calculations = O(t*n3)

… this is the order or algorithmic complexity.

• Other examples: QR tridiagonal

1,000 1,000,000 7,000

Perhaps a hybrid approach, e.g. python/C or

Avoid unnecessary use of pointers.

• Permits optimisations at compile-time.

gcc 1.0.0- 1.8.0_1

Comparison from https://julialang.org/benchmarks/.

Comparison code perf.f90 from https://julialang.org/benchmarks/.

• Do not rely on compiler extensions!

• Develop with multiple compilers. Cray

• Download a copy of the standard (or a

• Use Makefile variables, make them easy to find

git clone https://github.com/PawseySC/Optimising-Serial-Code.git

It’s case sensitive. Download them!

Have a look to the various “Makefile”s.

• Striding access has a low rate of hits, however modern processors

• On a Cray you can use larger pages.

e.g. a(n)=somefunc(n) + a(n-1)

Unroll with stride 2:

This code potentially contains many cache misses.

A has stride 1, B has stride Jmax.

• Make the stride of B smaller so that small blocks of A and B sit in

A(I,J) access pattern: contiguous A(I,J) access pattern

• The Cray Fortran compiler does this for you.

A has stride 1, B has stride Jblksize.

• Make the stride of B smaller so that small blocks of A and B sit in

• This is usually used in conjunction with other

• In C / C++, put the code in the same file as the

• Have a look at the man page for other

• Check out the manpage if needed

• Seek help: ifort -help reports

• Intel VTune: https://software.intel.com/en-us/intel-vtune-amplifier-xe

• Others: gprof, Arm MAP & Performance reports, etc.

• pat_report game_of_life.O3.cray+pat+XX-YYs/ >

• Where is all the time spent? What occurs on

Have the following:

• Build the executable with the defined tracing options:

• Pass variables through routine calls. (There is a

• Compiler not permitted to rearrange this:

• Compiler allowed to rearrange this:

Some compilers are unlikely to fuse these operations:

The Cray compiler does fuse array notation!

type(atom_type), dimension(n_atoms) :: atom_list

integer, dimension(n_atoms) :: atomic_numbers

Set a level for the trade-off between

Optimisation: write a separate routine for arrays

• module load cray-hdf5

Look at the SLURM output file.

• Use version control software. E.g. git,

• FFTW: fast fourier transforms, real/complex

$(MKLROOT) is set by “module load intel”

• C/C++ and Fortran interfaces

• Slides are available at

You might also like