0% found this document useful (0 votes)

80 views30 pages

2023 CSC14120 Lecture00 CourseIntroduction

Uploaded by

trần văn quyết

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views30 pages

2023 CSC14120 Lecture00 CourseIntroduction

Uploaded by

trần văn quyết

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Parallel Programming

Course Introduction

Phạm Trọng Nghĩa

ptnghia@fit.hcmus.edu.vn
Why need parallel?
• Many applications have demanded
more execution speed and
resources
• The rate of single-instruction stream
performance scaling has decreased
• Frequency scaling limited by power
• ILP scaling tapped out

• Architects are now building faster

processors by adding more execution
units that run in parallel
• Software must be written to be
parallel to see performance gains

2
CPU vs GPU

3
CPU vs GPU
CPU - Multicore GPU – Many core
- Have a few cores, each core is - Have many many cores, each
powerful and complex core is weak and simple
- Focus on execution speed - Focus on throughput

Image source: http://www.nvidia.com/object/what-is-gpu-computing.html 4

CPU vs GPU
CPU GPU
Have a few cores, each core is Have many many cores, each
powerful and complex core is weak and simple

Focus on optimizing latency; Focus on optimizing throughput;

latency = an amount of time to throughput = # tasks completed
complete a task in a time unit
Example: the task is transporting a person from location A to B,
the distance from A to B: 4500 km

Car: 2 people, 200 km/h Bus: 40 people, 50 km/h

Latency = ? h Latency = ? h
Throughput = ? people/h Throughput = ? people/h
5
CPU vs GPU
CPU GPU
Have a few cores, each core is Have many many cores, each
powerful and complex core is weak and simple

Focus on optimizing latency; Focus on optimizing throughput;

latency = an amount of time to throughput = # tasks completed
complete a task in a time unit
Example: the task is transporting a person from location A to B,
the distance from A to B: 4500 km

Car: 2 people, 200 km/h Bus: 40 people, 50 km/h

Latency = 22.5 h Latency = 90 h
Throughput = 0.09 people/h Throughput = 0.44 people/h
So, is car or bus better? 6
CPU vs GPU
CPU GPU - NVIDIA Tesla A100
- 24 core Intel multicore server - 108SM, 6912 CUDA cores and
microprocessor 432 Tensor cores
- 0.33 TLOPS for double- - 9.7 TFLOPS for 64-bit double-
precision and 0.66 TFLOPS for precision, 156 TFLOPS for 32-
single precision bit single-precision, and 312
TFLOPS for 16-bit half-precision

FLOPS (FLoating-point Operations Per Second)

TFLOPS (TeraFLOPS) 7
CPU: Latency-oriented design
• Powerful ALU
• Reduce operation latency.
• Increased chip area and power
• Large caches:
• Convert long-latency memory
accesses into short-latency
cache accesses
• Sophisticated control
• Branch prediction for reduced
branch latency dự đoán câu lệnh sẵn đỡ phải
đợi load

• Data forwarding for reduced data

latency
Reduces the execution latency of each individual thread 8
GPU: Throughput-oriented design
• Small caches
• To boost memory throughput
• Simple control:
• No branch prediction
• No data forwarding
• Energy efficient ALUs
• Many, long latency but heavily
pipelined for high throughtput
• Require massive number of
threads to tolerate latencies
• Threading logic
• Thread size
Reduces the execution latency of each individual thread 9
CPU + GPU

CUDA (Compute Unified Device Architecture) C/C++ is extended-C/C++, allows us to write a

program taking advantage of both CPU and GPU (NVIDIA): sequential parts will run on CPU,
massively parallel parts will run on GPU
Image source: John Cheng et al. Professional CUDA C
Programming. 2014 10
CPU + GPU

• Core area: sequential code.

• These portions are very hard to parallelize.
• CPUs tend to do a very good job on these portions.
• Take up a large portion of the code, but only a small portion of the
execution time
• "Peach flesh" portions:
• Easy to parallelize.
• Parallel programming in heterogeneous computing systems can
drastically improve the speed of these applications.
11
Applications of parallel programming on GPU

Image source: http://www.nvidia.com/object/gpu- 12

Challenges in parallel programming

• Question: Is parallel programming easier or hard?

• Answer:
• Easy: Do not care about performance, just want it able to run.
• Hard: when you want optimize, get higher performance

13
Challenges in parallel programming

• Challenging to design parallel algorithms with the same level

of algorithmic (computational) complexity as that of
sequential algorithms
• Some parallel algorithms do more work than their sequential
counterparts
• Parallelizing often requires non-intuitive ways of thinking about the
problem and may require redundant work during execution
• The execution speed of many applications is limited by
memory access latency and/or throughput
• Requires methods for improving memory access speed

14
Challenges in parallel programming

• Execution speed of parallel programs is often more sensitive

to the input data characteristics than is the case for their
sequential counterparts
• Unpredictable data sizes and uneven data distributions
• Require threads to collaborate with each other
• Using synchronization operations such as barriers or atomic
operations

Most of these challenges have been

addressed by researchers

15
3 Ways to Accelerate Applications

16
Libraries: Easy, High-Quality
• Ease of use: enables GPU acceleration without in-depth
knowledge of GPU programming
• “Drop-in”: Many GPU-accelerated libraries follow
standard APIs, thus enabling acceleration with minimal
code changes
• Quality: Libraries offer high-quality implementations of
functions encountered in a broad range of applications

17
NVIDIA GPU Accelerated Libraries

https://developer.nvidia.com/gpu-accelerated-libraries
18
Compiler Directives: Easy, Portable
• Ease of use: Compiler takes care of details of parallelism
management and data movement
• Portable: The code is generic, not specific to any type of
hardware and can be deployed into multiple languages
• Uncertain: Performance of code can vary across compiler
versions

19
Compiler Directives: OpenACC

https://ulhpc-tutorials.readthedocs.io/en/latest/gpu/openacc/basics/

20
Programming Languages: Most
Performance and Flexible
• Performance: Programmer has best control of parallelism
and data movement
• Flexible: The computation does not need to fit into a
limited set of library patterns or directive types
• Verbose: The programmer often needs to express more
details

21
Programming Languages: Most
Performance and Flexible

• MATLAB, Mathematica, LabVIEW

• PyCUDA, Numba
• CUDA Fortran, OpenACC
• CUDA C, OpenACC
• CUDA C++, Thrust
• Hybridizer

22
After successful completing the
Course topics: course, the student will be able
 Introduction to CUDA; example: to:
vector addition, convolution, … Parallelize common tasks to run
(3 weeks)
on GPU using CUDA
 GPU parallel execution in
CUDA; example: reduction, … Apply knowledge of GPU
(4 weeks) parallel execution in CUDA to
 Types of GPU memories in speed up a CUDA program
CUDA; example: reduction,
convolution, … (3 weeks)
Apply knowledge of GPU
memories in CUDA to speed up a
 Example: scan, histogram, sort
(4 weeks) CUDA program
 Optimizing a CUDA program; Apply the optimization process
additional topics in parallel to optimize a CUDA program
programming (1 week)
Apply teamwork skills to
complete final project
23
Course assessment
• Individual exercises throughout the course: 50% of the
grade
• Group final project: 50% of the grade, 2 students /
group

24
Course assessment
Remember: the main goal is to learn, truly learn

You can discuss ideas with others as well as consult Internet

sources, but your writing and code must be your own, based
on your own understanding

If you violate this rule, you will get 0 score for the course

25
Advices
• In this course, we will focus on parallel programming on
GPU (Graphics Processing Unit)
• Don’t worry if you don’t have GPU ;-)
• We will use Google Colab for this course.

26
Setup coding environment
• Where to find a machine with CUDA-enabled GPU?
• Google Colab, it’s free and ready to run CUDA programs ☺
• Even if you have your own GPU, you should use Google Colab because
teacher will use it to run and grade your programs
• Code, compile, and run:
• Write and save code (.cu file) in your local machine by your favorite editor
(with editors not recognizing .cu file automatically and not highlighting syntax
with colors, the simple way is to set language/syntax as C/C++)
• Open a notebook in Colab (you must sign in to your gmail), select “Runtime,
Change runtime type” and set “Hardware accelerator” as GPU, upload .cu file
• In a Colab cell, compile: !nvcc file-name.cu -o run-file-name
• If we don’t specify run-file-name, it will default to a.out
• In a Colab cell, run: !./run-file-name
• Demo …
27
RESOURCES
• Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022.
• David B. Kirk, Wen-mei W. Hwu. Programming Massively
Parallel Processors. Morgan Kaufmann, 2016
• Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014
• Lê Hoài Bắc, Vũ Thanh Hưng, Trần Trung Kiên. Lập trình
song song trên GPU. NXB KH & KT, 2015
• NVIDIA. Intro to Parallel Programming. Udacity
• NVIDIA. CUDA Toolkit Documentation
28
Reference
• [1] Slides from Illinois-NVIDIA GPU Teaching Kit
• [2] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022

29
THE END

Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Unit 4
100% (1)
Unit 4
48 pages
Owens
No ratings yet
Owens
67 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Cuda
No ratings yet
Cuda
69 pages
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
No ratings yet
From CPU To GPU With CUDA C Language: Michele Tuttafesta Dottorato Di Ricerca in Fisica 25 Ciclo
71 pages
GPU Computing Course Overview
No ratings yet
GPU Computing Course Overview
17 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
CSED405 Lec1-Course Intro - 240903 - 203340
No ratings yet
CSED405 Lec1-Course Intro - 240903 - 203340
65 pages
1 Cuda
100% (1)
1 Cuda
173 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
Note2 4
No ratings yet
Note2 4
11 pages
Parallel ProgrammingSyllabus
No ratings yet
Parallel ProgrammingSyllabus
2 pages
Introduction - CUDA C Programming Guide
No ratings yet
Introduction - CUDA C Programming Guide
573 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 1
No ratings yet
Lec 1
27 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
GPU Computing MCD541
No ratings yet
GPU Computing MCD541
1 page
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
cs179 2024 Lec01
No ratings yet
cs179 2024 Lec01
26 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
CUDA
No ratings yet
CUDA
46 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Operating Systems:: Threads
No ratings yet
Operating Systems:: Threads
26 pages
Concurrency Control Basics
No ratings yet
Concurrency Control Basics
27 pages
Unit 2 - Week 1: Introduction To Clouds, Virtualization and Virtual Machine
No ratings yet
Unit 2 - Week 1: Introduction To Clouds, Virtualization and Virtual Machine
48 pages
Chapter 3 (MultiThreading)
No ratings yet
Chapter 3 (MultiThreading)
41 pages
Unit 3 Process Scheduling and Deadloack
No ratings yet
Unit 3 Process Scheduling and Deadloack
19 pages
CSE-311 Operating Systems
No ratings yet
CSE-311 Operating Systems
22 pages
Distributed Computing Models
No ratings yet
Distributed Computing Models
31 pages
Parallel Programming: Process and Threads
No ratings yet
Parallel Programming: Process and Threads
18 pages
OS Unit - II
No ratings yet
OS Unit - II
74 pages
Parallel Processing Quiz No 1 Fill in The Blanks
No ratings yet
Parallel Processing Quiz No 1 Fill in The Blanks
2 pages
Blue and Green Modern Artificial Intelligence Presentation
No ratings yet
Blue and Green Modern Artificial Intelligence Presentation
23 pages
Certificate: MSG-SGKM College Arts, Science and Commerce
No ratings yet
Certificate: MSG-SGKM College Arts, Science and Commerce
29 pages
CS 179: GPU Computing: Lecture 2: More Basics
No ratings yet
CS 179: GPU Computing: Lecture 2: More Basics
23 pages
IT Exam Prep: Memory & OS Concepts
100% (1)
IT Exam Prep: Memory & OS Concepts
4 pages
Parallel Computing Assignment
No ratings yet
Parallel Computing Assignment
3 pages
Chapter 4
No ratings yet
Chapter 4
41 pages
03 Laboratory Exercise 1 ARG PDF
100% (1)
03 Laboratory Exercise 1 ARG PDF
3 pages
Corrected Assignment With Answers
No ratings yet
Corrected Assignment With Answers
19 pages
Operating Systems Test 1: Number of Questions: 35 Section Marks: 30
No ratings yet
Operating Systems Test 1: Number of Questions: 35 Section Marks: 30
6 pages
Real Time System - : BITS Pilani
No ratings yet
Real Time System - : BITS Pilani
16 pages
Chapter 5 CPU Scheduling
No ratings yet
Chapter 5 CPU Scheduling
29 pages
Distributed Systems Lab Record
No ratings yet
Distributed Systems Lab Record
24 pages
Operating System CSET209: Inter Process Communication (Ipc)
No ratings yet
Operating System CSET209: Inter Process Communication (Ipc)
45 pages
Quizzes: Chapter 06: An Operating System
No ratings yet
Quizzes: Chapter 06: An Operating System
5 pages
Exp Detail
No ratings yet
Exp Detail
10 pages
Real Time Scheduling Algorithms
No ratings yet
Real Time Scheduling Algorithms
3 pages
WS 2.4
No ratings yet
WS 2.4
3 pages
Grid Computing: A Comprehensive Guide
No ratings yet
Grid Computing: A Comprehensive Guide
26 pages
CS609 Update SOLVED MCQs FINAL TERM BY JUNAID
No ratings yet
CS609 Update SOLVED MCQs FINAL TERM BY JUNAID
33 pages
Ch2 - Process and Process Management
No ratings yet
Ch2 - Process and Process Management
70 pages

2023 CSC14120 Lecture00 CourseIntroduction

Uploaded by

2023 CSC14120 Lecture00 CourseIntroduction

Uploaded by

Parallel Programming

Phạm Trọng Nghĩa

• Architects are now building faster

Image source: http://www.nvidia.com/object/what-is-gpu-computing.html 4

Focus on optimizing latency; Focus on optimizing throughput;

Car: 2 people, 200 km/h Bus: 40 people, 50 km/h

Focus on optimizing latency; Focus on optimizing throughput;

Car: 2 people, 200 km/h Bus: 40 people, 50 km/h

FLOPS (FLoating-point Operations Per Second)

• Data forwarding for reduced data

CUDA (Compute Unified Device Architecture) C/C++ is extended-C/C++, allows us to write a

• Core area: sequential code.

Image source: http://www.nvidia.com/object/gpu- 12

• Question: Is parallel programming easier or hard?

• Challenging to design parallel algorithms with the same level

• Execution speed of parallel programs is often more sensitive

Most of these challenges have been

• MATLAB, Mathematica, LabVIEW

You can discuss ideas with others as well as consult Internet

You might also like