KEMBAR78
PDC Lecture 09 | PDF | Graphics Processing Unit | Parallel Computing
0% found this document useful (0 votes)
12 views36 pages

PDC Lecture 09

Uploaded by

Aqib khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views36 pages

PDC Lecture 09

Uploaded by

Aqib khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CS-402 Parallel and Distributed Systems

Spring 2025

Lecture No. 09
Graphics Processing Unit (GPU) overview
A Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
processing of images and videos. Originally developed for rendering graphics in video games, GPUs
have evolved to handle a wide range of parallel processing tasks.
Architecture and Functionality
Parallel Processing: GPUs consist of thousands of smaller, efficient cores designed for handling multiple
tasks simultaneously. This makes them ideal for tasks that can be broken down into smaller, parallel
operations.
SIMD Execution: GPUs use Single Instruction, Multiple Data (SIMD) execution, where the same operation
is performed on multiple data points simultaneously. This is particularly useful for graphics rendering and
scientific computations.
Types of GPUs
Integrated GPUs: Built into the CPU, these are common in laptops and lightweight desktops. They share
memory with the CPU and are suitable for basic tasks.
Discrete GPUs: Separate from the CPU, these are found in dedicated graphics cards. They have their
own memory (VRAM) and are used for more demanding tasks like gaming and professional graphics
work.
Professional GPUs: Optimized for applications requiring high accuracy and reliability, such as 3D
modeling, CAD, and scientific visualization.
Mobile GPUs: Designed for power efficiency and thermal management in mobile devices.
Server and Data Center GPUs: Used in data centers for tasks like AI training, deep learning, and
scientific simulations.
Applications of GPUs
• Gaming: GPUs render complex 3D graphics and provide smooth gameplay experiences.
• Scientific Computing: Used in simulations, weather modeling, and drug discovery due to their ability to
handle large-scale parallel computations.
• Machine Learning: Accelerate training processes in deep learning frameworks like TensorFlow and
PyTorch.
• Cryptocurrency Mining: Perform the complex mathematical calculations required for mining
cryptocurrencies.
• Video Editing and Rendering: Speed up tasks like video editing, rendering graphics, and effects
processing.
Programming Models, Challenges and Considerations
• CUDA and OpenCL: CUDA (Compute Unified Device Architecture) is a parallel computing platform
and programming model developed by NVIDIA. OpenCL (Open Computing Language) is an open
standard that supports a wide range of hardware, including GPUs from different vendors.
• Hierarchical Parallelism: GPUs expose hierarchical parallelism, allowing for coarse-grain task-level
parallelism and fine-grain data-level parallelism.
• Challenges
• Memory Bandwidth: Efficient memory management and optimization techniques are essential to fully
utilize GPU capabilities.
• Energy Efficiency: While GPUs offer high performance, they also consume significant power. Research
is ongoing to improve the energy efficiency of GPUs.
Graphics Processing Unit (GPU)
• GPU is the chip in computer video cards, PS3, Xbox, etc
o Designed to realize the 3D graphics pipeline
Application  Geometry  Rasterizer image

 GPU development:
o Fixed graphics hardware
o Programmable vertex/pixel shaders
o GPGPU
General purpose computation (beyond graphics) using GPU in applications other than 3D graphics
GPGPU can be treated as a co-processor for compute intensive tasks
With sufficient large bandwidth between CPU and GPU.
CPU and GPU

 CPU is designed for efficient general-purpose computing. Large chip area is


dedicated for optimizing control in general computing
o Out of order execution, multiple issues, etc. to improve single thread performance.

 GPU is specialized for compute intensive, highly data parallel computation


o More chip area is dedicated to processing
o Good for high arithmetic intensity programs with a high ratio between arithmetic
operations and memory operations.

 Heterogeneous system and computing: CPU+GPU with CPU for general purpose
processing and GPU for accelerating compute-intensive, data parallel regions.
CPU versus GPU
 CPU: The multicore trajectory seeks to maintain the execution speed of sequential
programs while moving into multiple cores.
 GPU: the many-thread trajectory focuses on the execution throughput of parallel
applications!

ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU
CPU versus GPU
Architecture of a modern GPU (van Newman Architecture)
Organization of a CPU-GPU system

Ever changing moving target


Compute Unified Device Architecture (CUDA)

 A parallel platform and programming model created by NVIDIA.

 Hardware/software architecture for NVIDIA GPU to execute programs

o Main concept: hardware support for hierarchy of threads to exploit data parallelism.
Compute Unified Device Architecture (CUDA)
1. CUDA Architecture
Host and Device: In CUDA terminology, the CPU is referred to as the “host,” and the GPU is referred to
as the “device.” The host manages the overall execution, while the device performs the parallel
computations.
Memory Hierarchy: CUDA provides a memory hierarchy that includes registers, shared memory, L1
cache, L2 cache, and global memory. Efficient use of this hierarchy is crucial for optimizing performance.
2. Programming Model
Kernels: CUDA programs are written in C/C++ and include special functions called kernels. Kernels are
executed on the GPU and can run thousands of threads in parallel. Each thread executes the same kernel
code but operates on different data.
Thread Hierarchy: Threads are organized into blocks, and blocks are organized into grids. This hierarchy
allows for scalable parallelism. Each block runs on a single streaming multiprocessor (SM) and can be
synchronized using intrinsic functions like __syncthreads().
Execution Configuration: When launching a kernel, you specify the number of blocks and the number of
threads per block using the <<<…>>> syntax. This configuration determines how the workload is
distributed across the GPU.
Compute Unified Device Architecture (CUDA)
Key Features
Unified Memory: CUDA supports unified memory, which allows the CPU and GPU to share a single
memory space. This simplifies memory management and data transfer between the host and device.
Streams and Concurrency: CUDA streams enable concurrent execution of multiple kernels and
memory operations. This can significantly improve performance by overlapping computation with
data transfer.
Libraries and Tools: NVIDIA provides a rich set of libraries (like cuBLAS, cuFFT, and Thrust) and tools
(like Nsight and CUDA-GDB) to help developers optimize and debug their CUDA applications.
Wide Application: CUDA is used in various fields, including scientific research, machine learning,
data analytics, and real-time rendering
Thread Block Organization: thread, block, grid

 Thread – distributed by the CUDA runtime


(identified by threadIdx)
 Block – A user defined group of 1 to 512 or
1024 threads (identified by blockIdx)
 Grid – a group of one or more blocks. A grid is
created for each CUDA kernel function called.
 Thread organization is like an array of thread
indices.
 Goal of NVIDIA GPU: run such massively
number of threads as efficient as possible.
Supporting thread execution
 GPU supports efficient execution of a massive number of threads (each thread id often
maps to one array index).
o The threads to be executed are organized into kernels
o Each kernel corresponds to one “grid” (of threads) – multiple kernels can be launched at the same time.
o Each grid has many “blocks” of threads (usually 512-1024). Example

dim3 dimBlock(4, 4, 4);


dim3 dimGrid(100, 100, 1);
kernel <<<dimGrid, dimBlock>>> (….)

o There is a maximum number of threads per block (e.g. 1024 depending on GPU)
o There is a maximum value of blocks per grid (65535 often)
o You cannot go over the limits.
o Much of the GPU programming is dealing with the hardware limitations.
Supporting thread execution
 Each GPU has multiple streaming multiprocessors (SMs)
o Threads are executed in a group of 32 (called a warp) in an SIMD manner (with one
instruction – at a given time, one instruction for the whole warp is executed).

o Each SM has many processors or CUDA cores (32 in Fermi and 128 in Ampere) and a
small number (1-4) instruction units.

o Each SM also has limited resources – there is a limit how many threads it can support,
which affects how efficient CUDA programs are written.
Streaming Multiprocessors

Latest generation

Block diagram
Early generation
SIMD and warp scheduler
 CUDA exploits data parallelism
o SIMD is an effective way to exploit data parallelism with reduced control logic

 CUDA uses a more flexible form (or an extension) of SIMD called SIMT (Single
instruction, multi-thread) to execute threads
o 32 threads are scheduled together called warp.
o All threads in a warp start at the same PC, but free to branch and execute independently.
o A warp executes one common instruction at a time.
 It performs the best when all threads share one instruction.
 If different threads in a warp run different instructions (SIMT allows this, SIMD does not), it would
take multiple cycles for the warp to complete one instruction – warp divergence.
Warp scheduling
Inside the GPU
 Run the massive number of threads effectively with minimum power.
 NVIDIA GPU architecture evolves overtime, raw computing power keeps increasing,
new features continue to be added.
o Fermi (2010)
o Kepler (2012)
o Maxwell (2014)
o Pascal (2016)
o Turing (2018)
o Ampere (2020)
 There is a white paper for each of these architectures from NVIDIA. We will discuss
key features in some of the generations.
Fermi
 16 streaming multiprocessors (SMs), each SM having 32 processing units (CUDA cores), total 512
CUDA cores
 Each core execute one floating point or integer instruction per clock for a thread
 Each SM:
o 32 Cuda cores
o 64KB shared memory/L1 cache
o 32K 32-bit register file per SM.
o Two warp schedulers, each with one instruction dispatch unit.
o 16 load/store units
o Support double precision operations

 Connecting to CPU: 1 work queue between CPU and GPU (At one time, one kernel is active)
 No special support for multi-GPU systems. Communication from one GPU to another GPU:
GPU memory->CPU memory-> GPU memory
Fermi compute capability
 Fermi Cuda 2.1

o max threads/thread block: 1024

o max warps / multiprocessor: 48

o max threads / multiprocessor: 1536

o max thread blocks / multiprocessor: 8

o 32-bit registers / multiprocessor: 32768

o Max registers/thread block: 32768

o Max registers / thread: 63

o Max shared memory / multiprocessor: 48KB

o Max shared memory / thread block: 48KB

o Max grid dimention: 2 −1


Fermi warp scheduling in an SM
Keplar (GK120)
 15 streaming multiprocessors (SMs) with 192 processing units (CUDA cores) each,
total 2880 CUDA cores
 Each SM:
o 192 Cuda cores
o 128KB shared memory/L1 cache (twice as Fermi)
o 128K 32-bit register file per SM. (4 times Fermi)
o 4 warp schedulers, each with two instruction dispatch units.
o 32 load/store units
 Hyper-Q: 32 Connecting to CPU: 32 kernels
 NVIDIA GPUDirect: Support multiple GPU in one server -- moving data between GPU
memory without going through CPU memory.
 Dynamic parallelism: new threads can be launched from the kernel.
Dynamic parallelism
 GPU can launch new threads
Kepler (GK120) compute capability

 Cuda 3.7 (Fermi Cuda 2.1)


o Threads / Warp: 32
o max threads/thread block: 1024
o max warps / multiprocessor: 64 (Fermi 48)
o max threads / multiprocessor: 2048 (Fermi 1536)
o max thread blocks / multiprocessor: 16 (Fermi 8)
o 32-registers / multiprocessor: 131K (Fermi 32768)
o Max registers/thread block: 65536 (Fermi 32768)
o Max registers / thread: 255 (Fermi 63)
o Max shared memory / multiprocessor: 112KB (Fermi 48KB)
o Max shared memory / thread block: 48KB
o Max grid dimension: 2 −1 (Fermi 2 − 1)
Maxwell
Maxwell is the codename for a GPU microarchitecture developed by NVIDIA, succeeding the Kepler
architecture and preceding Pascal1. Introduced in 2014, Maxwell brought significant improvements in power
efficiency and performance, making it a notable advancement in GPU technology.
Key Features of Maxwell:
Improved Streaming Multiprocessor (SM) Design: Maxwell introduced a new SM design, called SMM, which increased
energy efficiency and performance. This design allowed for better workload balancing and control logic partitioning.
Enhanced Memory Architecture: Maxwell GPUs featured a larger L2 cache, reducing the need for higher memory
bandwidth and improving overall efficiency.
Dynamic Super Resolution (DSR): This technology enabled GPUs to render games at higher resolutions and then
downscale them to fit the display, providing better image quality.
Voxel Global Illumination (VXGI): Maxwell introduced VXGI, which allowed for real-time dynamic global illumination,
enhancing the realism of lighting in games.
CUDA Compute Capability: Maxwell GPUs supported CUDA Compute Capability 5.0 and 5.2, enabling more advanced
parallel computing features.
Notable Maxwell GPUs:
GeForce GTX 750 and GTX 750 Ti: The first GPUs to feature the Maxwell architecture, offering significant improvements
in power efficiency.
GeForce GTX 970 and GTX 980: Popular high-performance GPUs that showcased Maxwell's capabilities in gaming and
professional applications.
GeForce GTX 980 Ti and Titan X: High-end GPUs that pushed the limits of performance and efficiency
Pascal (GP100)
 56 streaming multiprocessors (SMs) with 64 processing units (CUDE cores) each,
total 3584 FP32 CUDA cores (1792 FP64 cores)
 NVLink: more multi-GPU support.
 HBM2: high capacity stack memory architecture, 3X memory bandwidth than earlier
GPUs
 Unified memory: a hardware-software solution to provide a unified virtual address
space for CPU and GPU memory
NVLink
 Further support multi-GPU system, 160GB/s speed, 5x PCIe Gen 3 speed.

PCIe 1.0 8GB/s


PCIe 2.0 16GB/s
PCIe 3.0 32GB/s
PCIe 4.0 64GB/s
PCIe 5.0 128GB/s
Pascal (GP100) compute capability
 Cuda 6.0 (Kepler GK110 Cuda 3.7)
o Threads/Warp 32 (32)

o max warps / multiprocessor: 64 (64)

o max threads / multiprocessor: 2048 (2048)

o max thread blocks / multiprocessor: 32 (16)

o 32-registers / multiprocessor: 64K (64K)

o Max registers/thread block: 64K (64K)

o Max registers / thread: 255(255)

o Max shared memory / multiprocessor: 64KB (16KB-48KB)

o Max shared memory / thread block: 48KB


Turing (TU102)

 72 SM units with 64 CUDA cores each (4608 total), 10 tensor cores


 Turing tensor cores
 Memory compression

 The latest one: Ampere.


Here’s a simple example of a CUDA program that adds two vectors:

#include <cuda_runtime.h>
#include <iostream>
Example Code
__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

int main() {
int N = 1000;
size_t size = N * sizeof(float);

float *h_A = (float *)malloc(size);


float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);

for (int i = 0; i < N; ++i) {


h_A[i] = i;
h_B[i] = i * 2;
}

float *d_A, *d_B, *d_C;


cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);


cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;


int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < N; ++i) {


std::cout << h_C[i] << " ";
}

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);

return 0;
}

You might also like