Optimising Serial Code
Advancing Science with
Pawsey
Learning Objectives
• Choose an algorithm for good performance
• Choose a language for good performance
• Understand the importance of standard
conformance
• Write code that can be optimised for a modern
CPU
• Locate bottlenecks in a serial code and address
them
Course Roadmap
Programming Optimization
• correctness/novelty • cache performance
• complexity • FLOPS
• memory scaling • language selection • IPC • data locality
• etc • code efficiency • etc • loop
• standardization transformation
• etc • vectorization
• etc
Algorithm Profiling
Parallelization
Performance
Motivation
• Science will be limited by the code and
computer.
• Development effort may be secondary to
code performance.
• Many gains come with small effort.
• Code optimisation can force good
programming styles.
Is it worth the time?
From XKCD webcomic: https://xkcd.com/1205/
Development Effort
1 Magnus day = 49 years on a dual-core
desktop.
It is worth it to spend months of development
time to save years/decades of runtime!
This might go against common opinions.
You still need to consider unit testing, ease
of collaboration and code extension.
Know When to Stop Developing
Focus your effort!
1. Start with a good algorithm.
2. Write code with performance in mind.
3. Profile the code to determine where to spend
your optimising effort.
4. Optimise the code, then profile again, and repeat
until satisfactory performance achieved.
Know when to stop trying! Set an effort limit or a
runtime goal.
MEASURING PERFORMANCE
Busy vs Effective
• Low processor utilisation does mean sub-
optimal performance.
• High processor utilisation does not mean
optimal performance.
• Why?
• processor could be doing something
inefficient
• processor could be polling
Time
• What matters is real (your) time. Use the
clock, we call this walltime.
• You can use routines like gettimeofday,
MPI_Wtime to measure program runtime.
• Don’t use many timing calls to profile the
code. This is slow and alters the profile.
Instead use a profiler, which we will do
later!
• CPU clock speed can vary between runs!
Timing routines
• C / C++
clock_t clock(void): Returns CPU time in “clock ticks” since initial call. (Use
CLOCKS_PER_SEC to convert.). Resolution is microseconds.
time_t time( time_t *second): Returns real time in seconds since unix epoch 00:00:00 Jan
1, 1970. (Output argument “second” is also assigned with that value if it is not a NULL
pointer.). Resolution is in seconds.
• Fortran
CPU_TIME(TIME): Returns the processor time in seconds since the start of the program
through “TIME”, which is a REAL with INTENT(OUT). Resolution in miliseconds.
DATE_AND_TIME([date,time,zone,values]): Returns character and binary data on the
real-time clock and date. Resolution is miliseconds.
• glibc
int gettimeofday(struct timeval *tp, struct timezone *tzp): Returns the current calendar
time as the elapsed time since the epoch in the structure “tp”. Resolution is microseconds.
• MPI
double MPI_Wtime( void ): Returns time in seconds since an arbitrary time in the past.
Resolution is indicated by: double MPI_Wtick( void ).
CHOOSING AN ALGORITHM
Algorithms
Many common algorithms have:
• a period of time
• increments within that period
• elements of interest
Algorithm For t = t1 .. tn {over time increments}
• ways to visit these elements A For i = x1 .. xn
• a calculation <do something> For j = y1 .. yn
Grid For k = z1 .. zn
<Do something>
Algorithm For t = t1 .. tn {over time increments}
B For each element
For each other element
N-body <Do something>
Complexity
At this stage, we can make a few observations:
Algorithm A will do #t.#x.#y.#z calculations = O(t*n3)
Algorithm B will do #t.n.(n-1) calculations = O(t*n2)
… this is the order or algorithmic complexity.
Scaling
• Consider memory scaling as well as
compute scaling. Memory is finite.
Compute Time Memory
3D-grid with time O(t*n3) O(n3)
N-body O(t*n2) O(n)
• Other examples: QR tridiagonal
eigensolver is O(n2) in memory, while MR3
tridiagonal eigensolver is O(n) in memory.
Example Alternative Algorithm
• Instead of calculating every pairwise interaction
(brute force) in the N-body problem, represent
hierarchies of bodies with multipoles in a
spacial decomposition (influence of very distant
bodies is negligible).
• O(N2) interactions becomes O(N log N), even
O(N) for adaptive expansion.
N Brute force O(N2) Multipole model O(N log N)
1,000 1,000,000 7,000
Choosing an Algorithm
There may be multiple algorithms to solve a
problem. Consider:
• E.g. O(n3) vs O(n log n) or Brute Force vs
Clever approach with adequate results.
• Parallelisability constraints imposed by
the algorithm.
• Domain decomposition often leads to
good scaling and parallelisability. Load
balancing may become an issue.
Prefactor
• If two algorithms have the same prefactor,
does one have fewer expensive
operations? E.g. square roots,
exponentials, complex numbers, sin, cos
etc.
• Algorithms that scale well might have a
more expensive prefactor and be slower
for small n. Perhaps use two algorithms
and set a transition point based on n.
Rewrite your Maths
Simple Tricks
CHOOSING A LANGUAGE
Choosing a language
Choose the right language for the job.
• Performance.
• Potential to use OpenMP, OpenACC, CUDA, and/or
MPI.
• Flexible to new technologies.
• Portable.
• Available performance and debugging tools.
• Ease of programming.
Perhaps a hybrid approach, e.g. python/C or
python/Fortran
Other Language Considerations
Our Goals:
• Want to convert algorithms into code (without too
much effort).
• Want to help the compiler produce the fastest
code (without too much effort).
Some Considerations:
• Array / matrix support
• Complex number support
• Aliasing
• Static typing
Aliasing
Aliasing is when some variables *might* refer to the
same memory location. This introduces constraints on
compiler optimisations. (e.g. prevents code reordering).
• Fortran
Assumes no aliasing. Avoid unnecessary use of pointers
(stick to allocatable).
• C
Use the restrict keyword, since C99 standard.
• C++
__restrict__ etc are non-standard but in some compilers.
Not portable.
Avoid unnecessary use of pointers.
Static Typing
• With static typing, the type is known at
compile-time.
• Compiler can check that variables passed to
functions are compatible with the functions.
• Permits optimisations at compile-time.
• Improves reproducibility, reduces errors.
• API is well documented for others.
• Fortran, C, C++ all support static typing.
Language performance
Fortran Julia Python R Matlab Octave Mathematica JavaScript Go SciLua Java
gcc 1.0.0- 1.8.0_1
1.0.0 3.6.3 3.5.0 R2018a 4.2.2 11.3 V8 6.2.414 go1.9
7.3.1 b12 7
fibonacci 0.98 1.32 94.2 263 17.6 10045 132.1 3.51 1.80 1.18 3.63
parse_integer 6.87 2.19 19.76 50.37 178.2 574.4 22.66 5.03 0.96 0.97 3.17
quicksort 1.18 0.99 37.57 57.93 2.36 2221.3 44.48 4.28 1.24 1.56 2.98
mandelbrot 0.70 0.68 65.66 195.5 9.84 5812 18.29 1.134 0.77 1.00 1.42
pi_sum 1.00 1.01 14.77 11.69 1.01 317.5 1.45 1.08 1.00 0.99 1.08
matrix_statistics 1.54 1.63 17.73 20.97 8.09 46.24 7.49 13.97 6.08 1.70 5.02
matrix_multiply 1.15 0.97 1.18 8.26 1.16 1.21 1.18 31.77 1.43 1.08 8.07
Comparison from https://julialang.org/benchmarks/.
Times relative to C (gcc 7.3.1). Lower is better.
Ran on a single core (serial execution) on an Intel® Core™ i7-3960X 3.30GHz CPU with
64GB of 1600MHz DDR3 RAM, running openSUSE LEAP 15.0 Linux.
These are not optimal results, just how a typical researcher might program in those
languages.
Fortran Compiler Comparison
GCC Intel Cray
fibonacci 1 4.41 3.16
parse_integer 1 0.77 0.77
mandelbrot 1 0.50 -
quicksort 1 0.91 0.99
pi_sum 1 0.55 0.55
matrix_statistics 1 0.80 0.47
matrix_multiply 1 0.75 0.15
Comparison code perf.f90 from https://julialang.org/benchmarks/.
Runtimes normalised to the GNU runtimes. Lower is better.
Results on Magnus, all using “ftn –O3 perf.f90” under relevant compiler environments.
STANDARDS CONFORMANCE
Standard Conformance
• Your code will run on different systems and
different compilers over the years.
• Standards conformance significantly improves
the chance of reproducibility. (good for
science!)
• Reproducibility means “apples vs apples”
performance comparisons between systems
and compilers, and between profiling runs.
• Do not rely on compiler extensions!
How to write portable code (1)
• Use your compiler to check.
• gcc -std=c99 -pedantic myfile.c
• gcc -std=c++98 -pedantic myfile.cxx
• gfortran -std=f95 -pedantic myfile.f90
• Develop with multiple compilers. Cray
compilers are very pedantic.
• Download a copy of the standard (or a
draft).
How to write portable code (2)
• Try initialising variables to other values,
do you get the same answer?.
• gfortran -finit-real=inf -finit-logical=false -finit-
character=t …
• Try runtime checking. This is slower than
compile-time checking.
• gfortran -fcheck=all
Portable Makefiles
• Don’t hard code compiler-specific flags.
• There is no standard for compiler flags, just for the
code itself.
• -Wall is not portable among compilers
• Use Makefile variables, make them easy to find
and modify:
CFLAGS=-g -Wall
#CFLAGS=-O3
http://www.gnu.org/software/make/manual/
Exercise 1
Exercises are publicly available via Github:
git clone https://github.com/PawseySC/Optimising-Serial-Code.git
It’s case sensitive. Download them!
Have a look to the various “Makefile”s.
Exercise 2: matmul (1)
In the exercise directory:
cd matrix && make matrix
This produces the executables matrix.O0,
matrix.O1, matrix.O2, matrix.O3
(Have a quick look at the Makefile).
Run them through the queue:
sbatch run_matrices.slurm
Output is in the SLURM output file slurm-
JOBID.out
Exercise 2: matmul (2)
Compare the timings:
• What effect does optimisation level have on
calls to the external math routine dgemm, to
the intrinsic matmul, to manual looping of
matrix multiplication?
• Is there a single best method for any matrix
size at high optimisation?
• What is the main difference between
matmul3 and matmul4.
• If you have time, try with a different compiler.
MODERN COMPUTERS AND
OPTIMISATION
CPUs
• Most CPUs have multi-level caches, to reduce the time
taken to access data.
• A CPU can do a lot of work in the time it takes to access
main memory.
Data Locality
Location Access time Access time (cycles)
Register <1ns -
L1 cache 1ns 4
L2 cache 4ns 10
L3 cache 15-30ns 40-75
Memory 60ns 150
Solid state disk 50us 130,000
Hard disk 10ms 26,000,000
Tape 10sec 26,000,000,000
Source: https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Cache: Access Patterns
• Sequential access results in a higher rate of cache hits
• Striding access has a low rate of hits, however modern processors
can detect striding access patterns and pre-fetch cache lines
• Random access is typically the worst, with a low rate of hits and no
ability to predict subsequent access locations
Translation Lookaside Buffer
• The Translation Page Table maps the memory seen
by the program (Virtual Memory) into physical
memory. It sits in main memory and is slow.
• The Translation Lookaside Buffer is a cache of
recently used mappings (not actual content). Like
the main caches, aim to work within VM pages.
These are typically 4kB.
• On a Cray you can use larger pages.
• Before compiling: module load craype-
hugepages2M (size does not matter)
• Runtime: module load craype-hugepagesXXXM
(size does matter)
Being cache-friendly
• Read and write to contiguous chunks of
memory. Data is transferred in a cache
line.
• Avoid cache misses.
Writing to Memory
• Don’t store data in RAM if you don’t need
to. Use local temporary variables instead.
• These could be optimised into registers.
• In particular, don’t use global variables as
local temporary variables.
• Similarly, avoid using array sections for
temporary storage.
Instruction Pipelining
• Modern CPUs can complete multiple
Instructions Per Cycle (IPC), also known
as Instructions Per Clock.
• An average for Xeon is 4. You need to
keep the pipeline full to achieve this.
Loop Unrolling
Keep data in registers/cache via loop unrolling /
blocking with inlining. This can also improve IPC.
e.g. a(n)=somefunc(n) + a(n-1)
Unroll with stride 2:
do n=1,nmax,2
a(n)=somefunc(n) + a(n-1)
a(n+1)=somefunc(n+1) + a(n)
end do
Loop Counts
• Inner loops should have high iteration counts,
since loops themselves have a non-negligible
cost.
• Counter to this, very small inner loops may get
unrolled away. In this case do it manually.
Good Bad
do j=1,10 do n=1,10000
do n=1,10000 do j=1,10
work work
end do end do
end do end do
Knowing Loop Counts
• There is more potential for compiler
optimisation when the loop count is
known before the loop is started.
• Use “do”, “for” loops. Avoid “while” and
“do … while” loops.
Contiguous Memory
• Aim to work through contiguous chunks of
memory. Avoid unnecessary striding.
• In Fortran, the consecutive elements of a column
reside next to each other (column-major order).
• In C/C++, the consecutive elements of a row
reside next to each other (row-major order)
Fortran C /C++
do j=1,10 for (i=0;i<10;i++)
do i=1,10 for (j=0;j<10;j++)
A(i,j)=something a[i][j]=something
Loop Blocking (1)
Loop blocking/tiling/strip-mining …
This code potentially contains many cache misses.
do J=1,Jmax
do I=1,Jmax
A(I,J) = A(I,J) + B(J,I)
end do
end do
A has stride 1, B has stride Jmax.
• Make the stride of B smaller so that small blocks of A and B sit in
cache.
• If Imax and Jmax are very large, neither A or B can sit in cache in
whole.
Loop Blocking (2)
B(I,J) access pattern
B(I,J) access pattern: non contiguous
after blocking
A(I,J) access pattern: contiguous A(I,J) access pattern
after blocking
• The Cray Fortran compiler does this for you.
Most other compilers currently do not.
Loop Blocking (3)
After applying loop blocking technique
do J=1,Jblksize,Jmax
do I=1,Iblksize,Imax
do JJ=J,J+Jblksize
do II=I,I+Iblksize
A(II,JJ) = A(II,JJ) + B(JJ,II)
end do
end do
end do
end do
A has stride 1, B has stride Jblksize.
• Make the stride of B smaller so that small blocks of A and B sit in
cache.
• Only Iblksize x Jblksize of A or B sit in cache. Not exhausting cache.
Branching
• Remove branches from loops and change the loop
bounds. Branches are bad for IPC and pre-emptive
cache fetching.
• Avoid GOTO statements (in Fortran, C, C++). They can
affect cache/register use.
Before: After:
do n=1,nmax a(1)=0
if (n==1) a(n)=0 a(nmax)=1
a(n)=somefunc(n) do n=2,nmax-1
if (n==nmax) a(n)=1 a(n)=somefunc(n)
end do end do
Loop Fission
• Too much work in loops means that registers and/or
instruction cache may get exhausted.
• Perhaps only part of a loop is vectorisable (execute a
number of loop-iterations at the same time).
• Break these loops up.
Before: After:
do n=1,nmax do n=1,nmax
lots of work lots of work
some I/O end do
end do do n=1,nmax
some I/O
end do
Loop Fusion
Before: After:
do n=1,nmax do n=1,nmax
a(n)=somefunc(n) a(n)=somefunc(n)
end do b(n)=someotherfunc(n)
do n=1,nmax end do
b(n)=someotherfunc(n)
end do
• This is usually used in conjunction with other
techniques.
• Whether this is beneficial depends on the
work inside the loops. Too much work may
exhaust registers or cache.
Pointers
Avoid unnecessary use of pointers.
• Pointers might prevent some compiler
optimisations. They should be fine if they
have local scope.
• Pointers make it difficult to copy data to
GPU memory.
• Hard to optimise cache use, e.g. with
linked lists vs contiguous arrays.
MMX / 3DNOW / SSE / AVX
• These are Single-Instruction-Multiple-Data (SIMD)
operations.
• These are not part of standard Fortran / C / C++. They may
get deprecated over time.
• Compiler should insert these automatically. Write SIMD-
friendly code.
Inlining
• Function calls take time. You can remove this
time by placing the code into the calling routine.
• Compilers can inline code for you when using
high optimization levels. This is not guaranteed,
but you can make it more likely:
• In C / C++, put the code in the same file as the
calling routine. Use static functions.
• In C99 / C++, use the inline keyword.
• In Fortran, put the code in the same file / module
as the calling routine. Use the compiler directive
“forceinline”.
COMPILER OUTPUT
Cray Compiler Output
Cray compiler will output annotated version of source file
ftn -rm mycode.f90
Outputs mycode.lst
Examine annotated file to figure out what’s going on
Intel Compiler Output
• Optimisation reports.
• Compiler flag: -qopt-report=3
• Have a look at the man page for other
values to opt-report.
Exercise 3: Cray compiler output
• Run:
cd compiler_reports
make matrix.cray
• Check out the manpage if needed
• man crayftn
• Examine the output in matrix.lst.O*
• What has the compiler done with routine calls and
loops?
• Can you identify the reason of the timing results
from the previous exercise?
Exercise 4: Intel compiler output
• Run:
module swap PrgEnv-cray PrgEnv-intel
cd compiler_reports
make matrix.intel
• Seek help: ifort -help reports
• Examine the output in matrix.optrpt.O*
• Might be a bit too much information. Scale
back the reporting options in Makefile
PROFILING
Profiling phases
• Instrumentation: compile the source code
with extra compiler flags that enable the
recording of performance-relevant events.
• Measurement & analysis: run the
instrumented application on a
representative test case. Usually the
instrumented application is much slower
than the original one.
• Performance examination: collect and
analyse the measurement results.
Profilers
• On Cray supercomputers: Cray Tools
https://pubs.cray.com/content/S-2376/7.0.0/cray-performance-measurement-
and-analysis-tools-user-guide/craypat
https://support.pawsey.org.au/documentation/display/US/Profiling+with+Cray+T
ools
• Intel VTune: https://software.intel.com/en-us/intel-vtune-amplifier-xe
https://support.pawsey.org.au/documentation/display/US/Profiling+with+Intel+V
Tune
• Others: gprof, Arm MAP & Performance reports, etc.
https://support.pawsey.org.au/documentation/display/US/User+Training
https://support.pawsey.org.au/documentation/display/US/Training+Material
Profilers
• Profiling guide at Pawsey:
https://support.pawsey.org.au/documentation/display/US/Profiling
Full Profiling with CrayPAT
Sampling experiment
Instrumentation
• module load perftools
• Compile code, using Cray compiler wrappers (ftn, cc, CC) &
preserving object (.o) files
• pat_build myapp
• Generates executable named myapp+pat
Measurement & analysis
• Run ./myapp+pat as normal, this will generate an
• Output dir: myapp+pat+XX+YYs/ (or .xf file for small runs)
• pat_report myapp+pat+XX+YYs/ > myapp.sampling.report
(this also generates .ap2 file that can be viewed with Apprentice2,
and a build-options.apa file to be used in a tracing experiment)
Performance examination
• Read myapp.sampling.report file
• or use Apprentice2 (with X11 forwarding activated: “ssh –X”):
app2 myapp+pat+XX+YYs/ &
Full Profiling with CrayPat
Tracing experiment
Instrumentation
• First, perform a sampling experiment to generate the file:
myapp+pat+XX+YYs/build-options.apa
• pat_build -O myapp+pat+XX+YYs/build-options.apa
• Essentially pat_build -w -T funcs -g grps -u myapp (can be
edited to change –T funcs and –g grps)
• Generates executable named myapp+apa
Measurement & analysis
• Run ./myyapp+apa as normal, this will generate an
• Output dir: myapp+apa+XX+YYt/ (or .xf file for small runs)
• pat_report myapp+apa+XX+YYt/ > myapp.tracing.report
(also generates .ap2 file that can be viewed with Apprentice2)
Performance examination
• Read myapp.tracing.report file
• or use Apprentice2 (with X11 forwarding activated: “ssh –X”):
app2 myapp+apa+XX+YYt/ &
Exercise 5: profiling game of life (1)
Profile (sampling) the game_of_life code.
• module load perftools
• cd game_of_life
• make game_of_life.cray
• pat_build game_of_life.O3.cray
• sbatch run_game_profile.slurm
• pat_report game_of_life.O3.cray+pat+XX-YYs/ >
game_of_life.O3.cray.sampling.report
Exercise 5: profiling game of life (2)
Examine
game_of_life.O3.cray.sampling.report
• Where is all the time spent? What occurs on
these lines of the code?
• How good is our cache utilisation?
• Check optimisations in
game_of_life.O3.lst
• Try again with other levels of optimisation
(edit the Makefile)
Exercise 6: profiling matrix
multiplication (1) - sampling
Sampling experiment
• module load perftools
• cd profiling
• ftn -c –O2 matrix.f90
• ftn -o matrix.O2 matrix.o
• pat_build matrix.O2
• sbatch run_matrix_profile.slurm
• Generate the report:
pat_report matrix.O2+pat+XX+YYs/ > matrix.O2.sampling.report
• Examine matrix.O2.sampling.report
• You can also use aprentice2: app2 matrix.O2+pat+XX+YYs/ &
Exercise 6: profiling matrix (2) sampling
Exercise 6: profiling matrix (3) sampling
Exercise 6: profiling matrix (4) sampling
Exercise 6: profiling matrix
multiplication (5) - tracing
Tracing experiment
• Edit the build_options file to define tracing options:
vim matrix.O2+pat+XX-YYs/build_options.apa
Have the following:
-g mpi,blas,io,heap
-T test_matmul1$timeit_
-T test_matmul2$timeit_
-T test_matmul3$timeit_
-T test_matmul4$timeit_
-T test_matmul5$timeit_
• Build the executable with the defined tracing options:
pat_build -O matrix.O2+pat+XX-YYs/build_options.apa
Exercise 6: profiling matrix (6) tracing
• Edit the job script: vim run_matrix_profile.slurm
Have: expType=apa
• sbatch run_matrix_profile.slurm
• Generate the report:
pat_report matrix.O2+apa+XX-YYt/ > matrix.O2.tracing.report
• Examine: matrix.O2.tracing.report
• You can also use aprentice2: app2 matrix.O2+apa+XX+YYt/ &
• Use the previous exercises of the matrix multiplication (timing and
optimisation listing) to understand the results.
• Repeat the exercise with lower levels of optimisation.
Seek help
• man pat_build
• man pat_report
• pat_help all . > all_pat_help
Exercise 6: profiling matrix (7) tracing
Exercise 6: profiling matrix (8) tracing
Exercise 6: profiling matrix (9) tracing
Exercise 6: profiling matrix (10) tracing
Exercise 6: profiling matrix (11) tracing
Exercise 6: profiling matrix (12) tracing
From apprentice2: app2 matrix.O2+apa+XX+YYt/ &
Exercise 6: profiling matrix (13) tracing
Exercise 6: profiling matrix
multiplication (14) – tracing
Exercise 6: profiling matrix
multiplication (15) – tracing
Exercise 6: profiling matrix
multiplication (16) – tracing
CODING HABITS
Global variables
• Avoid global variables unless necessary.
They may make it difficult to convert to
multithreaded code in further development.
• Pass variables through routine calls. (There is a
slight performance overhead).
In Fortran arguments can be given intent(in),
intent(out) attributes. This assists the compiler.
Scoping in OpenMP becomes much easier.
May assist in auto-threading by compilers.
Parentheses in Fortran
• Try to avoid parentheses in Fortran; they force an
evaluation and prevent code arithmetic
rearrangements. Use temporary variables
instead.
• Compiler not permitted to rearrange this:
a = 2 * (c + d) – 2 * e
• Compiler allowed to rearrange this:
tmp=c+d
a=2*tmp – 2 * e
Fortran Array Notation
Fortran array notation is convenient and easy to read, but current compilers are
likely to not optimise them well.
Some compilers are unlikely to fuse these operations:
A(:,:)=1.0
C(:,:)=A(:,:)+B(:,:)
In the meantime:
do j,1,n
do i=1,m
A(i,j)=1.0
C(i,j)=A(i,j)+B(i,j)
end do
end do
The Cray compiler does fuse array notation!
Low Level Object Oriented
Low level Object Oriented programming has the potential for
poor performance. E.g. the below strided (not contiguous)
memory accesses.
type atom_type
integer :: atomic_number
double precision :: mass
double precision, dimension(3) :: position
end type atom_type
type(atom_type), dimension(n_atoms) :: atom_list
total_mass=0
do i=1,n_atoms
total_mass=total_mass+atom_list(i)%mass
end do
Low Level Object Oriented (2)
Less organised but faster code:
integer, dimension(n_atoms) :: atomic_numbers
double precision, dimension(n_atoms) :: masses
double precision, dimension(3,n_atoms) :: positions
total_mass=sum(masses)
Set a level for the trade-off between
maintainability/extensibility and performance.
Special Case Code
Assume we have a code that handles arrays of
varying length, and;
• the code creates temporary arrays;
• in practice an array of length 1 is the most
common situation.
Optimisation: write a separate routine for arrays
of length 1, and use temporary variables rather
than temporary arrays.
I/O
• Often the processor is doing little while
waiting for I/O.
• Ways to reduce I/O overhead:
• Use buffering (or don’t turn it off or flush).
• Output in binary, not formatted text.
• Use I/O libraries.
• Hierarchical Data Format (HDF5) is the
name of a set of file formats and libraries
designed to store and organize large
amounts of numerical data.
Exercise 7: I/O
Observe the effects of I/O techniques on
performance.
• module load cray-hdf5
• cd iobench && make iobench_hdf5
• sbatch run_iobench_hdf5.slurm
Look at the SLURM output file.
Use version control
• Some of your attempts at optimisation will
need to be undone. E.g. due to:
• Incorrect results.
• Slower performance.
• Use version control software. E.g. git,
subversion. You should be using this
anyway.
• Use informative comments in check-in.
MATH LIBRARIES
Popular libraries
• BLAS: basic linear algebra such as matrix-vector or matrix-matrix
operations.
• LAPACK:
• Simple matrix/vector operations
• Linear equations solvers
• Linear least squares
• Eigensolvers
• Singular value decomposition
• Real + Complex
• FFTW: fast fourier transforms, real/complex
Optimised vendor versions available. e.g. Intel MKL, Cray Libsci, SGI
SCSL, IBM ESSL. Some are multi-threaded.
Other libraries
• PLAPACK - better scaling eigensolver
(MRRR algorithm)
• PARPACK – sparse eigensolver
• MUMPS – parallel sparse direct solver
• Hypre – parallel linear solver
• Scotch – graph partitioning
• SuperLU – parallel sparse linear solver
• available at cray-tpsl
Intel Math Libraries
Intel MKL includes BLAS, LAPACK and FFT
libraries. It consists of multiple libraries – use the
Intel advisor to work out the compiler link options:
http://software.intel.com/sites/products/mkl/MKL_Link_Line_Advisor.html
Example output:
-L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -
lmkl_sequential -lmkl_core –lpthread -lm
$(MKLROOT) is set by “module load intel”
PetSc / Slepc
• It's an attempt at a black box solver suite.
(makes it easy to swap between various
solvers in various libraries). It links to
common libraries.
• C/C++ and Fortran interfaces
• Linear equation solvers
• Eigensolvers
• Dense and Sparse
• various finite element solver add-ons
Finish
• What’s next?:
• Come to the parallel course.
• Read optimisation guides from vendors.
Intel and Cray in particular.
• Slides are available at
https://support.pawsey.org.au/documentation/display/US/Tr
aining+Material