Parallel Programming
Course Introduction
Phạm Trọng Nghĩa
ptnghia@fit.hcmus.edu.vn
Why need parallel?
• Many applications have demanded
more execution speed and
resources
• The rate of single-instruction stream
performance scaling has decreased
• Frequency scaling limited by power
• ILP scaling tapped out
• Architects are now building faster
processors by adding more execution
units that run in parallel
• Software must be written to be
parallel to see performance gains
2
CPU vs GPU
3
CPU vs GPU
CPU - Multicore GPU – Many core
- Have a few cores, each core is - Have many many cores, each
powerful and complex core is weak and simple
- Focus on execution speed - Focus on throughput
Image source: http://www.nvidia.com/object/what-is-gpu-computing.html 4
CPU vs GPU
CPU GPU
Have a few cores, each core is Have many many cores, each
powerful and complex core is weak and simple
Focus on optimizing latency; Focus on optimizing throughput;
latency = an amount of time to throughput = # tasks completed
complete a task in a time unit
Example: the task is transporting a person from location A to B,
the distance from A to B: 4500 km
Car: 2 people, 200 km/h Bus: 40 people, 50 km/h
Latency = ? h Latency = ? h
Throughput = ? people/h Throughput = ? people/h
5
CPU vs GPU
CPU GPU
Have a few cores, each core is Have many many cores, each
powerful and complex core is weak and simple
Focus on optimizing latency; Focus on optimizing throughput;
latency = an amount of time to throughput = # tasks completed
complete a task in a time unit
Example: the task is transporting a person from location A to B,
the distance from A to B: 4500 km
Car: 2 people, 200 km/h Bus: 40 people, 50 km/h
Latency = 22.5 h Latency = 90 h
Throughput = 0.09 people/h Throughput = 0.44 people/h
So, is car or bus better? 6
CPU vs GPU
CPU GPU - NVIDIA Tesla A100
- 24 core Intel multicore server - 108SM, 6912 CUDA cores and
microprocessor 432 Tensor cores
- 0.33 TLOPS for double- - 9.7 TFLOPS for 64-bit double-
precision and 0.66 TFLOPS for precision, 156 TFLOPS for 32-
single precision bit single-precision, and 312
TFLOPS for 16-bit half-precision
FLOPS (FLoating-point Operations Per Second)
TFLOPS (TeraFLOPS) 7
CPU: Latency-oriented design
• Powerful ALU
• Reduce operation latency.
• Increased chip area and power
• Large caches:
• Convert long-latency memory
accesses into short-latency
cache accesses
• Sophisticated control
• Branch prediction for reduced
branch latency dự đoán câu lệnh sẵn đỡ phải
đợi load
• Data forwarding for reduced data
latency
Reduces the execution latency of each individual thread 8
GPU: Throughput-oriented design
• Small caches
• To boost memory throughput
• Simple control:
• No branch prediction
• No data forwarding
• Energy efficient ALUs
• Many, long latency but heavily
pipelined for high throughtput
• Require massive number of
threads to tolerate latencies
• Threading logic
• Thread size
Reduces the execution latency of each individual thread 9
CPU + GPU
CUDA (Compute Unified Device Architecture) C/C++ is extended-C/C++, allows us to write a
program taking advantage of both CPU and GPU (NVIDIA): sequential parts will run on CPU,
massively parallel parts will run on GPU
Image source: John Cheng et al. Professional CUDA C
Programming. 2014 10
CPU + GPU
• Core area: sequential code.
• These portions are very hard to parallelize.
• CPUs tend to do a very good job on these portions.
• Take up a large portion of the code, but only a small portion of the
execution time
• "Peach flesh" portions:
• Easy to parallelize.
• Parallel programming in heterogeneous computing systems can
drastically improve the speed of these applications.
11
Applications of parallel programming on GPU
Image source: http://www.nvidia.com/object/gpu- 12
Challenges in parallel programming
• Question: Is parallel programming easier or hard?
• Answer:
• Easy: Do not care about performance, just want it able to run.
• Hard: when you want optimize, get higher performance
13
Challenges in parallel programming
• Challenging to design parallel algorithms with the same level
of algorithmic (computational) complexity as that of
sequential algorithms
• Some parallel algorithms do more work than their sequential
counterparts
• Parallelizing often requires non-intuitive ways of thinking about the
problem and may require redundant work during execution
• The execution speed of many applications is limited by
memory access latency and/or throughput
• Requires methods for improving memory access speed
14
Challenges in parallel programming
• Execution speed of parallel programs is often more sensitive
to the input data characteristics than is the case for their
sequential counterparts
• Unpredictable data sizes and uneven data distributions
• Require threads to collaborate with each other
• Using synchronization operations such as barriers or atomic
operations
Most of these challenges have been
addressed by researchers
15
3 Ways to Accelerate Applications
16
Libraries: Easy, High-Quality
• Ease of use: enables GPU acceleration without in-depth
knowledge of GPU programming
• “Drop-in”: Many GPU-accelerated libraries follow
standard APIs, thus enabling acceleration with minimal
code changes
• Quality: Libraries offer high-quality implementations of
functions encountered in a broad range of applications
17
NVIDIA GPU Accelerated Libraries
https://developer.nvidia.com/gpu-accelerated-libraries
18
Compiler Directives: Easy, Portable
• Ease of use: Compiler takes care of details of parallelism
management and data movement
• Portable: The code is generic, not specific to any type of
hardware and can be deployed into multiple languages
• Uncertain: Performance of code can vary across compiler
versions
19
Compiler Directives: OpenACC
https://ulhpc-tutorials.readthedocs.io/en/latest/gpu/openacc/basics/
20
Programming Languages: Most
Performance and Flexible
• Performance: Programmer has best control of parallelism
and data movement
• Flexible: The computation does not need to fit into a
limited set of library patterns or directive types
• Verbose: The programmer often needs to express more
details
21
Programming Languages: Most
Performance and Flexible
• MATLAB, Mathematica, LabVIEW
• PyCUDA, Numba
• CUDA Fortran, OpenACC
• CUDA C, OpenACC
• CUDA C++, Thrust
• Hybridizer
22
After successful completing the
Course topics: course, the student will be able
Introduction to CUDA; example: to:
vector addition, convolution, … Parallelize common tasks to run
(3 weeks)
on GPU using CUDA
GPU parallel execution in
CUDA; example: reduction, … Apply knowledge of GPU
(4 weeks) parallel execution in CUDA to
Types of GPU memories in speed up a CUDA program
CUDA; example: reduction,
convolution, … (3 weeks)
Apply knowledge of GPU
memories in CUDA to speed up a
Example: scan, histogram, sort
(4 weeks) CUDA program
Optimizing a CUDA program; Apply the optimization process
additional topics in parallel to optimize a CUDA program
programming (1 week)
Apply teamwork skills to
complete final project
23
Course assessment
• Individual exercises throughout the course: 50% of the
grade
• Group final project: 50% of the grade, 2 students /
group
24
Course assessment
Remember: the main goal is to learn, truly learn
You can discuss ideas with others as well as consult Internet
sources, but your writing and code must be your own, based
on your own understanding
If you violate this rule, you will get 0 score for the course
25
Advices
• In this course, we will focus on parallel programming on
GPU (Graphics Processing Unit)
• Don’t worry if you don’t have GPU ;-)
• We will use Google Colab for this course.
26
Setup coding environment
• Where to find a machine with CUDA-enabled GPU?
• Google Colab, it’s free and ready to run CUDA programs ☺
• Even if you have your own GPU, you should use Google Colab because
teacher will use it to run and grade your programs
• Code, compile, and run:
• Write and save code (.cu file) in your local machine by your favorite editor
(with editors not recognizing .cu file automatically and not highlighting syntax
with colors, the simple way is to set language/syntax as C/C++)
• Open a notebook in Colab (you must sign in to your gmail), select “Runtime,
Change runtime type” and set “Hardware accelerator” as GPU, upload .cu file
• In a Colab cell, compile: !nvcc file-name.cu -o run-file-name
• If we don’t specify run-file-name, it will default to a.out
• In a Colab cell, run: !./run-file-name
• Demo …
27
RESOURCES
• Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022.
• David B. Kirk, Wen-mei W. Hwu. Programming Massively
Parallel Processors. Morgan Kaufmann, 2016
• Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014
• Lê Hoài Bắc, Vũ Thanh Hưng, Trần Trung Kiên. Lập trình
song song trên GPU. NXB KH & KT, 2015
• NVIDIA. Intro to Parallel Programming. Udacity
• NVIDIA. CUDA Toolkit Documentation
28
Reference
• [1] Slides from Illinois-NVIDIA GPU Teaching Kit
• [2] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022
29
THE END
30