GPU Programming
Course Code: CSGG3018
Instructor: AMIT GURUNG
Email: amit.gurung@ddn.upes.ac.in
Welcomes
12 programmes Ranked 501-600 in Ranked 801-850 in Ranked 46th in
A Grade accredited Rankings 2025 World Rankings 2025 Rankings 2024*
* University Category
Jan – May, 2025
1
Course overview
Course Code Course name L T P C
GPU Programming
3 0 0 3
Total Units to be Covered: 6 Total Contact Hours:34
Basic of Programming
Prerequisite(s): Basic knowledge of Computer Architecture
Course Objectives
• The objective of this course to
provide a deep knowledge of
parallel programming with GPU
architecture and APIs with its
practical applications.
Course Outcomes
On completion of this course, the students will be able to
CO1 Describe the GPU computer architecture, GPU
programming environments and Data Parallelism.
CO2 Explore the contents of Data parallel Execution
Model and CUDA Memories
CO3 Elaborate the data parallelism concepts in OpenCL
& OpenACC and compare OpenACC & CUDA
CO4 Illustrate the programs to solve problems and
execute on GPU.
Recommendations
Textbooks
1.IBM ICE Publications
Modes of Evaluation
Quiz/Assignment/ presentation/ extempore/ Written
Examination
• Examination Scheme
Components IA MID SEM End Sem Total
Weightage (%) 30 20 50 100
Internal Assessment
What is Concurrent
Programming?
Concurrent programming is the practice of
executing multiple tasks or processes
simultaneously.
• Key Concepts:
• Tasks can interact or run independently.
• Focuses on tasks appearing to run at the same time (not on how
hardware are used).
• Applications:
• Real-time systems (e.g., operating systems).
• Gaming and multimedia.
• Data processing.
Concurrency vs. Parallelism
Concurrency Parallelism
Tasks progress simultaneously. Tasks execute simultaneously.
May involve task switching. Requires multiple
(single core) processors/cores.
Example: Multi-threading. Example: GPU computations.
Why Concurrency is important?
1. Performance:
• Allows better utilization of resources.
2. Responsiveness:
• Keeps systems responsive during long-running tasks. (due to
switching)
3. Scalability:
• Handles large datasets and complex computations efficiently.
4. Modern Hardware:
• Exploits multi-core CPUs and GPUs for better performance.
Real-World Examples of Concurrency
• Web Servers:
• Handle multiple client requests simultaneously.
• Video Games:
• Render graphics, play audio, and handle user input
concurrently.
• Data Analytics:
• Process data streams in real time.
• Autonomous Systems:
• Sensor data processing and decision-making in
parallel.
Key Concepts in Concurrency
• Threads:
• Lightweight processes that run independently and can
share resources.
• Synchronization:
• Mechanisms to ensure threads access shared data safely.
• Example: Locks prevent simultaneous access (only single
thread).
• Example: Semaphores control thread access to resources
(multiple threads management).
Key Concepts in Concurrency
• Deadlocks:
• Happens when two or more tasks wait for each other indefinitely.
• Example: Task A locks Resource 1 and waits for Resource 2, while
Task B locks Resource 2 and waits for Resource 1.
• Prevention: Use a consistent order for locking resources or implement
timeout mechanisms.
• Race Conditions:
• Occurs when multiple threads modify shared data without proper
synchronization, leading to unpredictable outcomes.
• Example: Two threads incrementing a shared counter simultaneously.
• Solution: Use atomic operations or locks to synchronize access
Concurrency in GPU Programming
• GPUs are designed for:
• Massive parallelism.
• Executing thousands of threads
simultaneously.
• Why GPUs need concurrency:
• Efficiently handle compute-intensive tasks.
• Parallelize tasks like matrix operations, image
processing, etc.
Tools and Frameworks
1. CPU-based Concurrency:
POSIX Threads (Pthreads or Portable Operating
System Interface):
• Provides a standard API for creating and managing
threads.
• Offers flexibility but requires careful handling of
synchronization.
• OpenMP:
• Simplifies parallel programming with compiler
directives.
• Ideal for loop-level parallelism in shared-memory systems
Tools and Frameworks
2. GPU-based Concurrency:
• CUDA (Compute Unified Device Architecture):
• NVIDIA’s framework for parallel programming on GPUs.
• OpenCL (Open Computing Language):
• A portable framework for programming across GPUs and
other accelerators.
Summary on Concurrent Programming
• Concurrency enables multitasking and efficient use of
resources.
• Key concepts include threads, synchronization, deadlocks,
and race conditions.
• GPUs exploit concurrency to achieve massive parallelism.
• Understanding concurrency is essential for GPU
programming.
Parallel Programming
Parallel Programming
• Parallel programming allows multiple computations to run
simultaneously, improving speed and efficiency.
• Applications include scientific simulations, data analytics, and
machine learning.
• Key Concepts
• Task Parallelism: Dividing the problem into tasks processed
concurrently.
• Data Parallelism: Processing large datasets by distributing data
across cores.
• Synchronization: Managing dependencies between tasks.
Parallel programming workflow:
Eg. of master-worker model
Model - 1
Model - 2
Parallel Programming
Parallel Architectures
• Classifications of parallel programming models can
be divided broadly into two areas:
• Process interaction [Shared memory (e.g., multicore
CPUs) vs. Distributed memory (e.g., clusters)]
• Problem decomposition (data/task parallelism)
Parallel Programming
Process interaction
• Process interaction relates to the mechanisms by
which parallel processes are able to
communicate with each other.
• The most common forms of interaction are
shared memory and message passing, but
interaction can also be implicit (invisible to the
programmer).
Parallel Programming
Shared Memory
• Shared memory is an efficient means of passing data
between processes.
• Parallel processes share a global address space that
they read and write to asynchronously.
• Asynchronous concurrent access can lead to race
conditions
• Solution: mechanisms such as locks, semaphores and
monitors can be used to avoid these.
Parallel Programming
Message Passing
• In a message-passing model, parallel
processes exchange data through
passing messages to one another.
• It can be asynchronous, where a message
can be sent before the receiver is ready, or
synchronous, where the receiver must be
ready.
Parallel Programming
• Programming Models
• Thread-based (e.g., Pthreads, OpenMP).
• Message Passing Interface (MPI for distributed memory) is a
programming interface that allows for communication between
processes in a parallel computing environment.
• Data-Parallel (e.g., CUDA, OpenCL).
• Libraries: BLAS, TensorFlow (support for parallelism).
History of Graphics Processors
Assignment uploaded in LMS
Graphics Processing Units (GPUs)
• Architecture
• Specialized for data-parallel tasks like matrix
multiplication, making them ideal for graphics and
deep learning.
• Thousands of smaller cores compared to CPUs.
• Applications Beyond Graphics
• AI/ML training, high-performance computing (HPC),
cryptocurrency mining.
General-Purpose GPUs (GPGPUs)
• Definition
• GPUs used for tasks beyond graphics, such as
scientific computing and simulations.
• Programming GPGPUs
• CUDA (NVIDIA), OpenCL (not specific to any vendor).
• Libraries: cuDNN for deep learning, Thrust for
parallel programming.
Comparison: CPU vs GPU
CPU GPU
Simplified view of a GPU
Comparison: CPU vs GPU
• CPU Characteristics
• Few powerful cores.
• Optimized for sequential processing.
• Suitable for task-switching and latency-sensitive tasks.
• GPU Characteristics
• Thousands of smaller cores.
• Designed for parallelism and throughput.
• Best for tasks like image rendering, simulations, and neural
network training.
• Example Use Cases
• CPUs: Operating systems, databases.
• GPUs: Graphics, AI model training.
Heterogeneous Computing
• Definition
• Combining CPUs, GPUs, and other processors for
optimal workload distribution.
• Examples
• Hybrid systems like NVIDIA DGX or Intel Xeon with
integrated GPUs.
• Benefits include higher efficiency, cost-effectiveness, and
power savings.
• Nvidia DGX (Deep GPU Xceleration) represents a series of servers and workstations
designed by Nvidia, primarily geared towards enhancing deep learning applications
through the use of general-purpose computing on graphics processing units (GPGPU)
Programming GPUs using
CUDA/OpenCL/OpenACC
• CUDA
• Proprietary to NVIDIA GPUs.
• Features: Kernels, shared memory, warp-based parallelism.
• OpenCL
• Open standard supporting CPUs, GPUs, FPGAs.
• Portable but less optimized than CUDA.
• OpenACC
• High-level directives for parallelism.
• Ideal for researchers who need quick solutions without diving
into low-level programming.
References
https://en.wikipedia.org/
https://www.nvidia.com/content/PDF/nvidia-ampere-ga-
102-gpu-architecture-whitepaper-v2.pdf