Chapter 6
GPU and CUDA programming
Dr. Aryaf Al-adwan
Autonomous Systems Department
Faculty of Artificial Intelligence
What is GPU?
A GPU, or Graphics Processing Unit, is a specialized type of processor that is
designed to handle the complex calculations required for rendering images and
video. It is called a "graphics" processor because it was originally designed to
accelerate graphics rendering, but over time, the capabilities of GPUs have been
expanded to include other types of calculations, such as those needed for
machine learning and scientific simulations.
2
Computer Graphics
3
GPU VS CPU
4
5
Interconnection networks that can be used in a GPU
1. Bus-based interconnection networks: In this type of network, all the functional units of
the GPU are connected to a shared bus, which is used to transmit data between them.
2. Crossbar interconnection networks: In this type of network, each functional unit is
connected to a central crossbar switch, which routes data between the units.
3. Mesh interconnection networks: In this type of network, the functional units of the
GPU are organized into a two-dimensional mesh, with each unit connected to its
neighbors. Data is transmitted between units by sending it along the mesh.
4. Hierarchical interconnection networks: In this type of network, the functional units of
the GPU are organized into a hierarchy, with some units connected to others at
different levels. Data is transmitted between units by passing it up and down the
hierarchy.
6
GPU Architecture
• The main components of GPU architecture:
1. Cores: These are the computational units that perform the actual work of
executing instructions and performing calculations. The number and type of
execution units can vary depending on the GPU and its intended use case.
2. Memory hierarchy: A GPU typically has multiple levels of memory, ranging
from the cache memory to the DRAM (Dynamic Random Access Memory).
3. Interconnects: These are the pathways that allow different parts of the GPU
to communicate with each other and with the rest of the system.
4. Input/output (I/O) interfaces: These are the interfaces that allow the GPU to
communicate with the rest of the computer system.
7
8
3D Block
9
10
Nvidia's GeForce RTX
Graphics Cards
• The Geforce RTX 3090 is the most powerful GPU in the series, with 10496 CUDA cores and
24GB of GDDR6X memory. It is designed for use in high-end gaming PCs and workstations,
and is capable of delivering excellent performance in a wide range of graphics-intensive
tasks.
• RTX stands for Ray Tracing eXtreme
• The RTX 3070 is a more affordable option that is still capable of delivering excellent
performance in many games and applications. It has 5888 CUDA cores and 8GB of GDDR6
memory.
• The RTX 3060 is the most budget-friendly option in the series, with 3584 CUDA cores and
12GB of GDDR6 memory. It is designed for use in mid-range gaming PCs and is capable of
delivering good performance in many games and applications. 11
block diagram of Geforce RTX 3090 12
NVIDIA ampere architecture
1. A large number of CUDA cores: The Ampere architecture features a large
number of CUDA cores, which are specialized processors that are
designed specifically for executing parallel computations.
2. RT cores: The Ampere architecture includes specialized RT cores that are
designed to accelerate ray tracing calculations. Ray tracing is a technique
for rendering realistic 3D graphics that involves simulating the path of
light as it bounces off of objects in a scene
3. Tensor cores: The Ampere architecture also includes specialized tensor
cores that are designed to accelerate machine learning workloads.
Tensor cores are capable of performing fast matrix multiplications, which
are a key building block of many machine learning algorithms.
Ampere is a microarchitecture developed by NVIDIA
that is used in its GeForce line of GPUs. 13
PORTING TO CUDA
Application Code
Rest of Sequential
Compute-Intensive Functions CPU Code
GPU Use GPU to Parallelize CPU
+ 5
1
5 SIMPLE PROCESSING
FLOW
PCIe or NVLink Bus
1. Copy input data from CPU memory to GPU
memory
1
6 SIMPLE PROCESSING
FLOW
PCIe or NVLink Bus
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
1
7 SIMPLE PROCESSING
FLOW
PCIe or NVLink Bus
1. Copy input data from CPU memory to GPU
memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
CUDA Programming
• Compute Unified Device Architecture (CUDA) is a parallel computing
platform and application programming interface (API) created by Nvidia
in 2006, that gives direct access to the GPU’s virtual instruction set for
the execution of compute kernels.
• Kernels are functions that run on a GPU. When we launch a kernel, it is
executed as a set of Threads. Each thread is mapped to a single CUDA
core on a GPU and performs the same operation on a subset of data.
According to Flynn‘s taxonomy, it‘s a Single Instruction Multiple Data
(SIMD) computation.
18
Cont.
• You can use a number of programming languages to write
programs that will run on CUDA-enabled GPUs, including:
• C++
• Python
• Fortran
• C#
• Java
• Some of these languages have specialized libraries or interfaces
that are designed specifically for programming with CUDA, such
as PyCUDA for Python or JCuda for Java.
https://developer.nvidia.com/cuda-toolkit 19
To set up a CUDA program
1. Install a CUDA-enabled graphics card and the
CUDA Toolkit on your system.
2. Choose a programming language. You can write
CUDA programs in C, C++, or Python.
3. Choose an Integrated Development Environment
(IDE). There are several IDEs that you can use to
write and debug CUDA programs, such as
Microsoft Visual Studio, Eclipse, and Xcode.
20
CUDA toolkit
https://developer.nvidia.com/cuda-toolkit
https://www.youtube.com/watch?v=cL05xtTocmY 21
CUDA Keywords
• The CUDA programming model is a heterogeneous
model in which both the CPU and GPU are used.
• In CUDA, the host refers to the CPU and its memory,
while the device refers to the GPU and its memory.
• Code run on the host can manage memory on both the
host and device, and also launches kernels which are
functions executed on the device. These kernels are
executed by many GPU threads in parallel.
22
HETEROGENEOUS COMPUTING
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
CUDA Keywords and Functions 24
1. __ global__: This keyword is used to declare a global functions are designed to be called from the CPU and
run on the GPU. These functions are called "kernels," and they are the functions that execute parallel work
on the GPU.
2. __device__: This keyword is used to declare a device functions are also designed to be called from the GPU
and run on the GPU.
3. __host__: This keyword is used to declare a function as a host function, which is a function that is executed
on the CPU and can be called from the host (CPU) only.
4. __shared__: This keyword is used to declare a variable as shared memory, which is a type of memory that is
shared among threads
5. __constant__: This keyword is used to declare a variable as constant memory, which is a type of memory
that is stored in the GPU's memory and is accessed faster than global memory.
6. __syncthreads(): This function is used to synchronize the execution of threads within a thread block.
7. cudaMemcpy(): This function is used to copy data between the host and device memories.
8. cudaMalloc(): This function is used to allocate memory on the device.
9. cudaFree(): This function is used to free memory that was previously allocated on the device.
A typical sequence of operations for a
CUDA C program is:
1. Declare and allocate host and device memory.
2. Initialize host data.
3. Transfer data from the host to the device.
4. Execute one or more kernels.
5. Transfer results from the device to the host.
25
A quick comparison between CUDA and C
The __global__ specifier indicates a function that runs on device (GPU).
Such function can be called through host code, e.g. the main() function in the example,
and is also known as "kernels".
26
The triple angle brackets (<<<...>>>)
__global__ void mycode(float* data)
{
// kernel code goes here In this example, the
} kernel function mycode
is launched with a grid of
int main() 1 block and 256 threads
{ per block. The total
// launch the kernel with 256 threads per block, 1 block number of threads
mycode<<<1, 256>>>(data); launched is 1 x 256 =
return 0; 256.
}
27
Cont.
• CUDA makes four pieces of information available to
each thread:
1. The thread index (threadIdx)
2. The block index (blockIdx)
3. The size and shape of a block (blockDim)
4. The size and shape of a grid (gridDim)
28
29
how the thread index and thread ID relate to each other in a 1-dimensional, 2-dimensional,
and 3-dimensional block?
1-dimensional block: 2-dimensional block: 3-dimensional block:
In this case, the thread In this case, the thread ID
index and thread ID are the is computed using the
same in a 1-dimensional thread index (x, y) and the In this case, the thread ID is
block. block size (Dx, Dy) using computed using the thread index (x,
the following formula: y, z) and the block size (Dx, Dy, Dz)
using the following formula:
thread ID = x + y * Dx
thread ID = x + y * Dx + z * Dx * Dy
30
Thread Identification Example: x-direction
What is the global id for the third thread in the fourth block?
31
C++ program that utilizes CUDA for performing
vector addition on the GPU
1. This program performs vector addition of two input vectors "a" and "b" of
length N and stores the result in the output vector "c".
2. The kernel function "addVectors" is executed on the GPU using a large
number of threads, with each thread performing the addition on a different
element of the vectors.
32
#include <iostream>
#include <cuda_runtime.h>
// Launch the kernel
__global__ void addVectors(float *a, float *b, float *c, int n) int blockSize = 256;
{ int numBlocks = (N + blockSize - 1) / blockSize;
int i = threadIdx.x + blockIdx.x * blockDim.x; addVectors<<<numBlocks, blockSize>>>(a, b, c, N);
if (i < n) {
c[i] = a[i] + b[i]; // Copy result vector from device to host memory
} cudaMemcpy(c, c, N * sizeof(float),
} cudaMemcpyDeviceToHost);
int main() { // Print the result
const int N = 100; for (int i = 0; i < N; i++) {
float *a, *b, *c; std::cout << c[i] << std::endl;
cudaMalloc(&a, N * sizeof(float)); }
cudaMalloc(&b, N * sizeof(float));
cudaMalloc(&c, N * sizeof(float)); cudaFree(a);
cudaFree(b);
// Initialize input vectors a and b cudaFree(c);
for (int i = 0; i < N; i++) { return 0;
a[i] = i; }
b[i] = 2 * i;
}
// Copy input vectors from host to device memory
cudaMemcpy(a, a, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(b, b, N * sizeof(float), cudaMemcpyHostToDevice); 33
1. __global__ void addVectors: This is the kernel function that performs the vector addition. The __global__
keyword specifies that this function is executed on the GPU. The function takes four arguments: the input
vectors "a" and "b", the output vector "c", and the size "n" of the vectors. Inside the function, the thread
index and block index are used to calculate the corresponding element of the output vector "c".
2. cudaMalloc: allocates memory on the GPU for the arrays "a", "b", and "c" using the cudaMalloc function.
The function takes a pointer to a memory location on the GPU, and the size of the memory to be allocated
in bytes.
3. initializes the input vectors "a" and "b" on the host. "a" is initialized with values 0 to N-1, and "b" is
initialized with values 0 to 2*(N-1)
4. cudaMemcp: This copies the input vectors "a" and "b" from host memory to the GPU memory using the
cudaMemcpy function. The function takes four arguments: the destination memory location on the GPU,
the source memory location on the host, the size of the memory to be copied in bytes, and the direction of
the copy
5. int numBlocks = (N + blockSize - 1) / blockSize;: This line calculates the number of blocks needed to
cover all elements of the input arrays.
addVectors<<<numBlocks, blockSize>>>(a, b, c, N);: This line launches the kernel. The triple angle
brackets <<< >>> are used to specify the grid and block dimensions. The first argument is the number of
blocks, and the second argument is the number of threads per block.a
34
6. cudaMemcpy(c, c, N * sizeof(float), cudaMemcpyDeviceToHost);
This line copies the result array from device memory to host memory.
7. for (int i = 0; i < N; i++) { std::cout << c[i] << std::endl; }
This loop prints the results on the host.
8. cudaFree(a); cudaFree(b); cudaFree(c);
These lines free the memory allocated for the input and output arrays on the GPU.
35
Neural Networks Are Embarrassingly Parallel
The term "embarrassingly parallel" is often used to describe problems or
computations that can be easily divided into smaller independent parts that can
be executed simultaneously, with little or no communication between the parts.
In the context of neural networks, it typically refers to the process of training a
model on multiple machines, or on multiple cores of a single machine.
An "embarrasingly parallel" problem can be easily parallelized without the need
for complex algorithms, and no communication is required during the
parallelization process.
36
37
Parallel Neural Network
• To accelerate the DNN training process on a multi-
GPUs.
• Two types of parallelization strategies:
1. data parallelism
2. model parallelism
38
In data parallelism (shown in Figure 1(a)), the input data is split into
mini-batches and the mini-batches are distributed across GPUs.
http://xzt102.github.io/publications/2021_WWW.pdf 39
Parameter averaging the simplest approach to data parallelism. The core idea is to split the training
data between the available workers, which hold the same model and train them independently. After
they have trained for a controlled amount of time, all the models are collected and the weights are
averaged between the workers
40
Convolution Example
• For each position on the blue input channel, the 3 x 3
filter does a computation that maps the shaded part of
the blue input channel to the corresponding shaded
part of the green output channel.
• In the animation, these computations are happening
sequentially one after the other. However, each
computation is independent from the others, meaning
that none of the computations depend on the results of
any of the other computations.
• As a result of this, all of these independent
computations can happen in parallel on a GPU and the
overall output channel can be produced.
• This allows us to see that the convolution operation can
be accelerated by using a parallel programming
approach and GPUs. 41
PyTorch Comes With CUDA
42
The previous code is moving the tensor t from the CPU to the GPU (specifically,
to the device at index 0 of the GPU), which will allow for faster computations on
the tensor. But, this alone is not enough for parallelism on a neural network.
To achieve parallelism, you would need to program the neural network to run
on multiple GPUs, using a library such as
1. torch.nn.DataParallel or
2. torch.nn.parallel.DistributedDataParallel.
43
import torch
import torch.nn as nn
from torch.nn.parallel import DataParallel
# Define a simple neural network
class SimpleNet(nn.Module):
def __init__(self):
super(SimpleNet, self).__init__()
self.fc = nn.Linear(10, 10) This code applies the data parallelism
def forward(self, x): torch.nn.DataParallel is a wrapper class that is
x = self.fc(x) used to parallelize the forward pass of a model
return x
across multiple GPUs in the current device. It
torch.set_num_threads(4) works by splitting the input data across the
# Create an instance of the network available GPUs, forwarding a copy of the
net = SimpleNet() model on each GPU, then concatenating the
# Wrap the network in DataParallel results before returning them.
net = DataParallel(net)
# Generate some random input data
inputs = torch.randn(64, 10)
# Perform a forward pass
outputs = net(inputs) 44
Thank You!