Independent University, Bangladesh
Department of Computer Science and Engineering
Course Title: Introduction to High Performance Computing
Course Code: Autumn-2024-CSC471
SECTION 1: (T) 06:30 PM --- 9:30 PM
Presented by
Dr. Rubaiyat Islam
Adjunct Faculty, IUB.
Omdena Bangladesh Chapter Lead
Crypto-economist Consultant
Sifchain Finance, USA.
WHAT IS APPLICATION
TIMING?
• Analysis of a program’s behavior using information gathered
as the program runs
• Why do it?
• Good way to improve efficiency of scripts
• Identify performance problems
• Many times required for allocation requests
HOW WOULD YOU DO IT?
• Measure time of execution of an entire program or simply
a code snippet
• Loops
• Timing functions within programs
• Python, Fortran, C++, R have functions that allow you to
measure the execution time of small code snippets
• Can also do this with the Linux “time” command
• Can make changes to code to improve efficiency
• Or…it’s just informational
THE LINUX TIME UTILITY
• The first place to start when profiling your program
time mpirun –np 4 ./prog.mpi
real 0m17.801s Wall clock time
user 0m58.125s Threads x Wall clock
time
sys 0m0.081s System overhead
FINE-GRAINED TIMING
• Often useful to time portions of a program
• Good idea when developing your own code
• Tough when its 3rd party software
• Useful functions:
• Fortran: system_clock
• C++ clock()
• Python: time.time()
• R sys.time()
SERIAL VS. PARALLEL
PROCESSING
• Serial processing
• A problem is broken into a set of discrete instructions
• These instructions are carried out sequentially on a single
processor
• Only one instruction is executed at a tifme
• Parallel processing
• Idea where many instructions are carried out simultaneously
across a computing system
• Can divide a large problem up into many smaller problems
6
WHY PARALLELIZE?
• Single core too slow for solving the problem in a
“reasonable” time
• “Reasonable” time: overnight, over lunch, duration of a PhD
thesis
• Memory requirements
• Larger problem
• More physics
• More particles
• Larger images
• Larger neural networks
7
BASIC COMPUTER
ARCHITECTURE
• Old computers – one unit to • New computers have 4 or
execute instructions more cpu cores
8
SERIAL PROCESSING –
THOUGHT EXPERIMENT
• Let’s say you own a lawn service company
• You have one hundred clients who each want their lawn
mowed, with patterns
• Each of them want their lawn mowed by the end of the week
• A serial process would be for you to mow all one hundred
laws yourself
• You cannot mow lawn 2 until you mow lawn 1, etc
• Let’s say doing this takes you the full 7 days to complete,
working 16 hour days
9
SERIAL PROCESSING
• Instructions are executed on
one core
• The other cores sit idle
• If a task is running, Task 2 waits
for Task 1 to complete, etc.
• Wasting resources
• Want to instead parallelize and
use all cores
10
SERIAL VS. PARALLEL
PROCESSING
1
2
PARALLEL PROCESSING –
THOUGHT EXPERIMENT
• Let’s say that you decide that 100 lawns is too many for one
person to mow in a week
• Or you want to finish it faster
• Therefore you hire one additional person to help you
• How long (in theory) should it take you to finish the lawns?
• Either 3.5 days working 16 hours each day, or 7 days working 8 hour
days
• You could accomplish this either by both working on one lawn
together or each of you working on a different lawn at the
same time (more on this later)
1
3
PARALLEL PROCESSING –
THOUGHT EXPERIMENT
• Similarly, you could hire three more people
• Now five total
• How long should it take you to finish?
• In theory, five times faster
• However, it doesn’t actually work out this way. Why?
• Overhead
• Communication
• Who is mowing which lawn?
• If you split a lawn, who mows which parts?
• How do you make sure the patterns match up?
1
4
PARALLEL PROCESSING – THOUGHT
EXP ERIMENT (CONT.)
• However, it doesn’t actually work out this way. Why?
• Resource contention
• Fights over who gets to use the best lawn mower
• So maybe instead of five times as fast its four times as fast
• Still faster
• More people?
• Too many people slows down the process too much to make it
worthwhile
• Diminishing return
• 100 might be too many
1
5
PARALLEL OVERHEAD
• Should you convert your serial code to parallel?
• Usually do it to speed up
• But need to consider things like overhead
• Overhead because of
• Startup time
• Synchronizations
• Communication
• Overhead by libraries, compilers
• Termination time
https://computing.llnl.gov/tutorials/parallel_comp/#ModelsShared
1
6
PROGRAMMING TO USE
PARALLELISM
• Parallelism across
processors/threads - OpenMP
• Parallelism across multiple
nodes - MPI
www.scan.co.uk
PARALLEL PROCESSING MUSTS
AND TRICKS
• Need to be able to break the problem up into parts that
can work independently of each other
• Can’t have the results from one CPU depend on another at
each time step
• Do loops are a great place to start looking for bottlenecks
in your code
1
8
MEMORY MODELS
• There are three common kinds of parallel memory models
• Shared
• Distributed
• Hybrid
SHARED MEMORY MODEL
2
0
SHARED MEMORY MODEL
• All cores share the same pool of memory
• HPC Architecture – we talked about the memory available on one node
• Any memory changes seen by all processors
2
1
THOUGHT EXPERIMENT
• Let’s go back to our lawn mowing example
• From the serial vs. parallel processing
• In this example, the lawns are the memory
• The workers are the cores
• When all the workers are working on one lawn, they are sharing the memory
• Every “core” is impacted by changes to “memory”
2
2
BENEFITS AND DRAWBACK
• Benefit:
• Data sharing is fast
• Drawback:
• Adding more processors may lead to performance issues when accessing the same
shared memory resource (memory contention)
DISTRIBUTED MEMORY MODEL
2
4
DISTRIBUTED MEMORY MODEL
• In a distributed memory model, each core has its own memory
• Processors share memory only through a network connection and/or
communication protocol ( e.g., MPI )
• Changes to local memory associated with processor do not have an impact on
other processors
• Remote-memory access must be explicitly managed by the programmer
2
5
THOUGHT EXPERIMENT
• Let’s go back to our lawn mowing example
• From the serial vs. parallel processing
• In this example, the lawns are the memory
• The workers are the cores
• When each worker is working on a different lawn, it is distributed memory
2
6
BENEFITS AND DRAWBACKS
• Biggest benefit is scalability
• Adding more processors doesn’t result in resource contention as far as memory is
concerned
• Biggest Drawback
• Can be tedious to program for distributed memory models
• All data relocation must be programmed by hand
HYBRID MEMORY MODEL
2
8
HYBRID MEMORY MODEL
• As the name implies, the hybrid memory model is a combination of the shared
and distributed memory models
• Most large and fast clusters today admit a hybrid-memory model
• A certain number of cores share the memory on one node, but are connected to
the cores sharing memory on other nodes through a network
2
9
THOUGHT EXPERIMENT
• Let’s go back to our lawn mowing example
• From the serial vs. parallel processing
• In this example, the lawns are the memory
• The workers are the cores
• The idea that you have several workers working on one lawn
• Or, better, several workers working on sections of a lawn, and have to
communicate to make it work
• Patterns
BENEFITS AND DRAWBACKS
• Benefit:
• Scalability
• Drawback
• Must know how to program communication between
nodes (e.g., MPI)
DATA AND TASK PARALLELISM
• Earlier discussed data parallel memory methods
• One of them was distributed memory, wherein different
memory pools were accessed by different cores on a single
node
• Data and task parallelism are a similar concept
• Data parallelism
• Distribute the data across processors
• Task parallelism
• Distribute the compute tasks across processors
DATA PARALLELISM
• Different parts of a dataset are distributed across nodes
array1=a b c d
NODE 1 NODE 2
a b c d
TASK PARALLELISM
• Each processor executes a different task on the same
dataset
• Tasks (code, instructions) are spread out among the cores
• Might be same instructions/code or different
• Distributed programming
• Example: Calculating wind speed from vector components
across a geographic area. Divide vector calculation among
processors
DATA PARALLELISM - SIMD
• Two types of data parallelism we’ll discuss here
• SIMD – Single Instruction, Multiple Data
• SPMD – Single Program, Multiple Data
• SIMD
• Carry out the same instruction simultaneously multiple times across
different elements of a dataset
• Vector operation
• Addition, subtraction, multiplication, division
• Have to prepare your data to be vectorized
VECTORIZATION
Non-vectorized
a=rand(1,4)
b=rand(1,4)
• Simply put, performing multiple math for i=1:length(a)
operations c(i)=a(i)+b(i)
end
Vectorized
a=rand(1,4)
b=rand(1,4)
c=a+b
Python, R, etc.
Compiled languages – compiler can handle it
DATA PARALLELISM - SPMD
• SPMD
• Carry out the same program multiple times on different elements of a dataset
• Calculate the wind direction from wind components
a1 a2 b1 b2
Program
c1 c2
WHY DO THIS?
• Cleaner code
• Faster execution time
• Eliminating loops!
• Usually not too challenging
• Many languages have functions that make this easy to perform
3
8
HIGH THROUGHPUT COMPUTING
• Thus far: High Performance Computing (HPC)
• Typical HPC: employ multiple processors to
• Solve a problem faster
• Solve larger problems
• Today: High Throughput Computing (HTC)
• Typical HTC:
• Multiple small jobs spread across many processors
3
9
HIGH THROUGHPUT COMPUTING
• HTC useful when have many small jobs that require little computational power
or memory
• Jobs are typically serial, and not parallel
• HTC advantage:
• Small serial jobs can fill in the “gaps” left by large parallel jobs
• E.g., Open Science Grid
• Effectively parallel: batch of jobs completes faster when spread across multiple cores
• Example: Image analysis
4
0
ADVANTAGES AND DISADVANTAGES
• Advantages
• Simplicity
• Much easier to match one task to one CPU rather than many at
once
• Doesn’t require knowledge of parallelization for programmer
• Disadvantage
• Your HPC center might not be set up ideally for HTC
• Might not allow for node sharing
• No batch submission system to manage multiple small jobs
• Possibly requires heavy scripting to manage workflow
4
1
HTC MECHANICS
• No real tricks
• Break down your problem and then submit a lot of smaller jobs
• For example, if you are analyzing 1 million images, rather than submitting
one job to analyze all 1 million images, submit one thousand jobs that
analyze 1000 images each
• The resource manager (i.e., Slurm) should take care of the rest
• Is HTC appropriate? Problem dependent!
• Is the serial execution time reasonable?
• Can the problem fit into one core’s worth of memory?
THANK YOU
42