0% found this document useful (0 votes)

12 views36 pages

PDC Lecture 09

Uploaded by

Aqib khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views36 pages

PDC Lecture 09

Uploaded by

Aqib khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CS-402 Parallel and Distributed Systems

Spring 2025

Lecture No. 09
Graphics Processing Unit (GPU) overview
A Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate the
processing of images and videos. Originally developed for rendering graphics in video games, GPUs
have evolved to handle a wide range of parallel processing tasks.
Architecture and Functionality
Parallel Processing: GPUs consist of thousands of smaller, efficient cores designed for handling multiple
tasks simultaneously. This makes them ideal for tasks that can be broken down into smaller, parallel
operations.
SIMD Execution: GPUs use Single Instruction, Multiple Data (SIMD) execution, where the same operation
is performed on multiple data points simultaneously. This is particularly useful for graphics rendering and
scientific computations.
Types of GPUs
Integrated GPUs: Built into the CPU, these are common in laptops and lightweight desktops. They share
memory with the CPU and are suitable for basic tasks.
Discrete GPUs: Separate from the CPU, these are found in dedicated graphics cards. They have their
own memory (VRAM) and are used for more demanding tasks like gaming and professional graphics
work.
Professional GPUs: Optimized for applications requiring high accuracy and reliability, such as 3D
modeling, CAD, and scientific visualization.
Mobile GPUs: Designed for power efficiency and thermal management in mobile devices.
Server and Data Center GPUs: Used in data centers for tasks like AI training, deep learning, and
scientific simulations.
Applications of GPUs
• Gaming: GPUs render complex 3D graphics and provide smooth gameplay experiences.
• Scientific Computing: Used in simulations, weather modeling, and drug discovery due to their ability to
handle large-scale parallel computations.
• Machine Learning: Accelerate training processes in deep learning frameworks like TensorFlow and
PyTorch.
• Cryptocurrency Mining: Perform the complex mathematical calculations required for mining
cryptocurrencies.
• Video Editing and Rendering: Speed up tasks like video editing, rendering graphics, and effects
processing.
Programming Models, Challenges and Considerations
• CUDA and OpenCL: CUDA (Compute Unified Device Architecture) is a parallel computing platform
and programming model developed by NVIDIA. OpenCL (Open Computing Language) is an open
standard that supports a wide range of hardware, including GPUs from different vendors.
• Hierarchical Parallelism: GPUs expose hierarchical parallelism, allowing for coarse-grain task-level
parallelism and fine-grain data-level parallelism.
• Challenges
• Memory Bandwidth: Efficient memory management and optimization techniques are essential to fully
utilize GPU capabilities.
• Energy Efficiency: While GPUs offer high performance, they also consume significant power. Research
is ongoing to improve the energy efficiency of GPUs.
Graphics Processing Unit (GPU)
• GPU is the chip in computer video cards, PS3, Xbox, etc
o Designed to realize the 3D graphics pipeline
Application  Geometry  Rasterizer image

 GPU development:
o Fixed graphics hardware
o Programmable vertex/pixel shaders
o GPGPU
General purpose computation (beyond graphics) using GPU in applications other than 3D graphics
GPGPU can be treated as a co-processor for compute intensive tasks
With sufficient large bandwidth between CPU and GPU.
CPU and GPU

 CPU is designed for efficient general-purpose computing. Large chip area is

dedicated for optimizing control in general computing
o Out of order execution, multiple issues, etc. to improve single thread performance.

 GPU is specialized for compute intensive, highly data parallel computation

o More chip area is dedicated to processing
o Good for high arithmetic intensity programs with a high ratio between arithmetic
operations and memory operations.

 Heterogeneous system and computing: CPU+GPU with CPU for general purpose
processing and GPU for accelerating compute-intensive, data parallel regions.
CPU versus GPU
 CPU: The multicore trajectory seeks to maintain the execution speed of sequential
programs while moving into multiple cores.
 GPU: the many-thread trajectory focuses on the execution throughput of parallel
applications!

ALU ALU
Control
ALU ALU

Cache

DRAM DRAM

CPU GPU
CPU versus GPU
Architecture of a modern GPU (van Newman Architecture)
Organization of a CPU-GPU system

Ever changing moving target

Compute Unified Device Architecture (CUDA)

 A parallel platform and programming model created by NVIDIA.

 Hardware/software architecture for NVIDIA GPU to execute programs

o Main concept: hardware support for hierarchy of threads to exploit data parallelism.
Compute Unified Device Architecture (CUDA)
1. CUDA Architecture
Host and Device: In CUDA terminology, the CPU is referred to as the “host,” and the GPU is referred to
as the “device.” The host manages the overall execution, while the device performs the parallel
computations.
Memory Hierarchy: CUDA provides a memory hierarchy that includes registers, shared memory, L1
cache, L2 cache, and global memory. Efficient use of this hierarchy is crucial for optimizing performance.
2. Programming Model
Kernels: CUDA programs are written in C/C++ and include special functions called kernels. Kernels are
executed on the GPU and can run thousands of threads in parallel. Each thread executes the same kernel
code but operates on different data.
Thread Hierarchy: Threads are organized into blocks, and blocks are organized into grids. This hierarchy
allows for scalable parallelism. Each block runs on a single streaming multiprocessor (SM) and can be
synchronized using intrinsic functions like __syncthreads().
Execution Configuration: When launching a kernel, you specify the number of blocks and the number of
threads per block using the <<<…>>> syntax. This configuration determines how the workload is
distributed across the GPU.
Compute Unified Device Architecture (CUDA)
Key Features
Unified Memory: CUDA supports unified memory, which allows the CPU and GPU to share a single
memory space. This simplifies memory management and data transfer between the host and device.
Streams and Concurrency: CUDA streams enable concurrent execution of multiple kernels and
memory operations. This can significantly improve performance by overlapping computation with
data transfer.
Libraries and Tools: NVIDIA provides a rich set of libraries (like cuBLAS, cuFFT, and Thrust) and tools
(like Nsight and CUDA-GDB) to help developers optimize and debug their CUDA applications.
Wide Application: CUDA is used in various fields, including scientific research, machine learning,
data analytics, and real-time rendering
Thread Block Organization: thread, block, grid

 Thread – distributed by the CUDA runtime

(identified by threadIdx)
 Block – A user defined group of 1 to 512 or
1024 threads (identified by blockIdx)
 Grid – a group of one or more blocks. A grid is
created for each CUDA kernel function called.
 Thread organization is like an array of thread
indices.
 Goal of NVIDIA GPU: run such massively
number of threads as efficient as possible.
Supporting thread execution
 GPU supports efficient execution of a massive number of threads (each thread id often
maps to one array index).
o The threads to be executed are organized into kernels
o Each kernel corresponds to one “grid” (of threads) – multiple kernels can be launched at the same time.
o Each grid has many “blocks” of threads (usually 512-1024). Example

dim3 dimBlock(4, 4, 4);

dim3 dimGrid(100, 100, 1);
kernel <<<dimGrid, dimBlock>>> (….)

o There is a maximum number of threads per block (e.g. 1024 depending on GPU)
o There is a maximum value of blocks per grid (65535 often)
o You cannot go over the limits.
o Much of the GPU programming is dealing with the hardware limitations.
Supporting thread execution
 Each GPU has multiple streaming multiprocessors (SMs)
o Threads are executed in a group of 32 (called a warp) in an SIMD manner (with one
instruction – at a given time, one instruction for the whole warp is executed).

o Each SM has many processors or CUDA cores (32 in Fermi and 128 in Ampere) and a
small number (1-4) instruction units.

o Each SM also has limited resources – there is a limit how many threads it can support,
which affects how efficient CUDA programs are written.
Streaming Multiprocessors

Latest generation

Block diagram
Early generation
SIMD and warp scheduler
 CUDA exploits data parallelism
o SIMD is an effective way to exploit data parallelism with reduced control logic

 CUDA uses a more flexible form (or an extension) of SIMD called SIMT (Single
instruction, multi-thread) to execute threads
o 32 threads are scheduled together called warp.
o All threads in a warp start at the same PC, but free to branch and execute independently.
o A warp executes one common instruction at a time.
 It performs the best when all threads share one instruction.
 If different threads in a warp run different instructions (SIMT allows this, SIMD does not), it would
take multiple cycles for the warp to complete one instruction – warp divergence.
Warp scheduling
Inside the GPU
 Run the massive number of threads effectively with minimum power.
 NVIDIA GPU architecture evolves overtime, raw computing power keeps increasing,
new features continue to be added.
o Fermi (2010)
o Kepler (2012)
o Maxwell (2014)
o Pascal (2016)
o Turing (2018)
o Ampere (2020)
 There is a white paper for each of these architectures from NVIDIA. We will discuss
key features in some of the generations.
Fermi
 16 streaming multiprocessors (SMs), each SM having 32 processing units (CUDA cores), total 512
CUDA cores
 Each core execute one floating point or integer instruction per clock for a thread
 Each SM:
o 32 Cuda cores
o 64KB shared memory/L1 cache
o 32K 32-bit register file per SM.
o Two warp schedulers, each with one instruction dispatch unit.
o 16 load/store units
o Support double precision operations

 Connecting to CPU: 1 work queue between CPU and GPU (At one time, one kernel is active)
 No special support for multi-GPU systems. Communication from one GPU to another GPU:
GPU memory->CPU memory-> GPU memory
Fermi compute capability
 Fermi Cuda 2.1

o max threads/thread block: 1024

o max warps / multiprocessor: 48

o max threads / multiprocessor: 1536

o max thread blocks / multiprocessor: 8

o 32-bit registers / multiprocessor: 32768

o Max registers/thread block: 32768

o Max registers / thread: 63

o Max shared memory / multiprocessor: 48KB

o Max shared memory / thread block: 48KB

o Max grid dimention: 2 −1

Fermi warp scheduling in an SM
Keplar (GK120)
 15 streaming multiprocessors (SMs) with 192 processing units (CUDA cores) each,
total 2880 CUDA cores
 Each SM:
o 192 Cuda cores
o 128KB shared memory/L1 cache (twice as Fermi)
o 128K 32-bit register file per SM. (4 times Fermi)
o 4 warp schedulers, each with two instruction dispatch units.
o 32 load/store units
 Hyper-Q: 32 Connecting to CPU: 32 kernels
 NVIDIA GPUDirect: Support multiple GPU in one server -- moving data between GPU
memory without going through CPU memory.
 Dynamic parallelism: new threads can be launched from the kernel.
Dynamic parallelism
 GPU can launch new threads
Kepler (GK120) compute capability

 Cuda 3.7 (Fermi Cuda 2.1)

o Threads / Warp: 32
o max threads/thread block: 1024
o max warps / multiprocessor: 64 (Fermi 48)
o max threads / multiprocessor: 2048 (Fermi 1536)
o max thread blocks / multiprocessor: 16 (Fermi 8)
o 32-registers / multiprocessor: 131K (Fermi 32768)
o Max registers/thread block: 65536 (Fermi 32768)
o Max registers / thread: 255 (Fermi 63)
o Max shared memory / multiprocessor: 112KB (Fermi 48KB)
o Max shared memory / thread block: 48KB
o Max grid dimension: 2 −1 (Fermi 2 − 1)
Maxwell
Maxwell is the codename for a GPU microarchitecture developed by NVIDIA, succeeding the Kepler
architecture and preceding Pascal1. Introduced in 2014, Maxwell brought significant improvements in power
efficiency and performance, making it a notable advancement in GPU technology.
Key Features of Maxwell:
Improved Streaming Multiprocessor (SM) Design: Maxwell introduced a new SM design, called SMM, which increased
energy efficiency and performance. This design allowed for better workload balancing and control logic partitioning.
Enhanced Memory Architecture: Maxwell GPUs featured a larger L2 cache, reducing the need for higher memory
bandwidth and improving overall efficiency.
Dynamic Super Resolution (DSR): This technology enabled GPUs to render games at higher resolutions and then
downscale them to fit the display, providing better image quality.
Voxel Global Illumination (VXGI): Maxwell introduced VXGI, which allowed for real-time dynamic global illumination,
enhancing the realism of lighting in games.
CUDA Compute Capability: Maxwell GPUs supported CUDA Compute Capability 5.0 and 5.2, enabling more advanced
parallel computing features.
Notable Maxwell GPUs:
GeForce GTX 750 and GTX 750 Ti: The first GPUs to feature the Maxwell architecture, offering significant improvements
in power efficiency.
GeForce GTX 970 and GTX 980: Popular high-performance GPUs that showcased Maxwell's capabilities in gaming and
professional applications.
GeForce GTX 980 Ti and Titan X: High-end GPUs that pushed the limits of performance and efficiency
Pascal (GP100)
 56 streaming multiprocessors (SMs) with 64 processing units (CUDE cores) each,
total 3584 FP32 CUDA cores (1792 FP64 cores)
 NVLink: more multi-GPU support.
 HBM2: high capacity stack memory architecture, 3X memory bandwidth than earlier
GPUs
 Unified memory: a hardware-software solution to provide a unified virtual address
space for CPU and GPU memory
NVLink
 Further support multi-GPU system, 160GB/s speed, 5x PCIe Gen 3 speed.

PCIe 1.0 8GB/s

PCIe 2.0 16GB/s
PCIe 3.0 32GB/s
PCIe 4.0 64GB/s
PCIe 5.0 128GB/s
Pascal (GP100) compute capability
 Cuda 6.0 (Kepler GK110 Cuda 3.7)
o Threads/Warp 32 (32)

o max warps / multiprocessor: 64 (64)

o max threads / multiprocessor: 2048 (2048)

o max thread blocks / multiprocessor: 32 (16)

o 32-registers / multiprocessor: 64K (64K)

o Max registers/thread block: 64K (64K)

o Max registers / thread: 255(255)

o Max shared memory / multiprocessor: 64KB (16KB-48KB)

o Max shared memory / thread block: 48KB

Turing (TU102)

 72 SM units with 64 CUDA cores each (4608 total), 10 tensor cores

 Turing tensor cores
 Memory compression

 The latest one: Ampere.

Here’s a simple example of a CUDA program that adds two vectors:

#include <cuda_runtime.h>
#include <iostream>
Example Code
__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}

int main() {
int N = 1000;
size_t size = N * sizeof(float);

float h_A = (float )malloc(size);

float *h_B = (float *)malloc(size);
float *h_C = (float *)malloc(size);

for (int i = 0; i < N; ++i) {

h_A[i] = i;
h_B[i] = i * 2;
}

float d_A, d_B, *d_C;

cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
cudaMalloc(&d_C, size);

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;

int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < N; ++i) {

std::cout << h_C[i] << " ";
}

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(h_A);
free(h_B);
free(h_C);

return 0;
}

HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Cuda
No ratings yet
Cuda
69 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
AMPE Tema4 GPU Architecture
No ratings yet
AMPE Tema4 GPU Architecture
95 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
6 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Unit 4
100% (1)
Unit 4
48 pages
GPU Programming Essentials
33% (3)
GPU Programming Essentials
28 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
PART19
No ratings yet
PART19
20 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
GPU Programming for Developers
No ratings yet
GPU Programming for Developers
9 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA for Developers & Researchers
No ratings yet
CUDA for Developers & Researchers
77 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
ch6 Notes
No ratings yet
ch6 Notes
5 pages
Gpus
No ratings yet
Gpus
32 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
Section 2 TR
No ratings yet
Section 2 TR
26 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Cuda
No ratings yet
Cuda
25 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
23 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
CUDA
No ratings yet
CUDA
46 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Secure Networks (Firewall)
No ratings yet
Secure Networks (Firewall)
33 pages
TransectionManagement Part1
No ratings yet
TransectionManagement Part1
29 pages
Feasibility Analysis
No ratings yet
Feasibility Analysis
2 pages
Accountability and Auditing
No ratings yet
Accountability and Auditing
15 pages
Lecture 9
No ratings yet
Lecture 9
31 pages
PDC Lecture 12
No ratings yet
PDC Lecture 12
42 pages
PDC-Assignment 03 1
No ratings yet
PDC-Assignment 03 1
1 page
Source Code
No ratings yet
Source Code
7 pages
App File
No ratings yet
App File
1 page
Assignment 4 (21-cs-51 & 21-cs-98)
No ratings yet
Assignment 4 (21-cs-51 & 21-cs-98)
8 pages
Muhammad Sameer Ahmed 23-CS-56 Assignment 2
No ratings yet
Muhammad Sameer Ahmed 23-CS-56 Assignment 2
3 pages
ReportTask 1
No ratings yet
ReportTask 1
3 pages
Excel 365 Lab Evaluation Sample
No ratings yet
Excel 365 Lab Evaluation Sample
6 pages
Assignment 3
No ratings yet
Assignment 3
1 page
USR-EG628 DataSheet V1.0.1
100% (1)
USR-EG628 DataSheet V1.0.1
4 pages
CPU Power User Buyer's Guide Memory
100% (1)
CPU Power User Buyer's Guide Memory
112 pages
NF5288M5 DataSheet
No ratings yet
NF5288M5 DataSheet
4 pages
Retentive Network: A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network: A Successor To Transformer For Large Language Models
14 pages
NVIDIA Story
No ratings yet
NVIDIA Story
31 pages
HPE ProLiant DL380 Gen10 server-PSN1010026818USEN
No ratings yet
HPE ProLiant DL380 Gen10 server-PSN1010026818USEN
5 pages
A Survey of Medical Image Registration On Graphics
No ratings yet
A Survey of Medical Image Registration On Graphics
15 pages
NVIDIA Jetson Notes
No ratings yet
NVIDIA Jetson Notes
3 pages
Poker Log
No ratings yet
Poker Log
3 pages
Lastexception 63848095273
No ratings yet
Lastexception 63848095273
1 page
Unleashing The Power of Graphics A Comprehensive Exploration 4og06ra5gq
No ratings yet
Unleashing The Power of Graphics A Comprehensive Exploration 4og06ra5gq
7 pages
MC4203 - Cloud Computing Technologies
No ratings yet
MC4203 - Cloud Computing Technologies
98 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Reac 2024 4a
No ratings yet
Reac 2024 4a
79 pages
Hydra Big Tutorial
No ratings yet
Hydra Big Tutorial
49 pages
NVIDIA Video Codec SDK 6.0 High-Performance-Video PDF
No ratings yet
NVIDIA Video Codec SDK 6.0 High-Performance-Video PDF
34 pages
(EXTERNAL) Segment Anything - Everything
No ratings yet
(EXTERNAL) Segment Anything - Everything
65 pages
Nvidia DGX A100 Datasheet PDF
No ratings yet
Nvidia DGX A100 Datasheet PDF
2 pages
American
No ratings yet
American
14 pages
Hackintosh Install Guide
No ratings yet
Hackintosh Install Guide
13 pages
MaximumPC October 2011
100% (1)
MaximumPC October 2011
72 pages
Creating Talking Head Videos With Generative AI - by Sau Sheong
No ratings yet
Creating Talking Head Videos With Generative AI - by Sau Sheong
26 pages
CORSAIR 7000 Series XD3 XD5 XD7
No ratings yet
CORSAIR 7000 Series XD3 XD5 XD7
3 pages
ASUS G750JX-IB71 Gaming Laptop Specs
No ratings yet
ASUS G750JX-IB71 Gaming Laptop Specs
2 pages
Trustpoint - One Machine Translation
No ratings yet
Trustpoint - One Machine Translation
20 pages
Graphics Certification Table 2020 0604 Tcm27 77702
No ratings yet
Graphics Certification Table 2020 0604 Tcm27 77702
57 pages
SEED Digital Literacy TOT Manual
No ratings yet
SEED Digital Literacy TOT Manual
107 pages
Handbrake Av1 Solution Brief Final Version
No ratings yet
Handbrake Av1 Solution Brief Final Version
3 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
Computer Maintenance and Repair
No ratings yet
Computer Maintenance and Repair
16 pages

PDC Lecture 09

Uploaded by

PDC Lecture 09

Uploaded by

CS-402 Parallel and Distributed Systems

 CPU is designed for efficient general-purpose computing. Large chip area is

 GPU is specialized for compute intensive, highly data parallel computation

Ever changing moving target

 A parallel platform and programming model created by NVIDIA.

 Hardware/software architecture for NVIDIA GPU to execute programs

 Thread – distributed by the CUDA runtime

dim3 dimBlock(4, 4, 4);

o max threads/thread block: 1024

o max warps / multiprocessor: 48

o max threads / multiprocessor: 1536

o max thread blocks / multiprocessor: 8

o 32-bit registers / multiprocessor: 32768

o Max registers/thread block: 32768

o Max registers / thread: 63

o Max shared memory / multiprocessor: 48KB

o Max shared memory / thread block: 48KB

o Max grid dimention: 2 −1

 Cuda 3.7 (Fermi Cuda 2.1)

PCIe 1.0 8GB/s

o max warps / multiprocessor: 64 (64)

o max threads / multiprocessor: 2048 (2048)

o max thread blocks / multiprocessor: 32 (16)

o 32-registers / multiprocessor: 64K (64K)

o Max registers/thread block: 64K (64K)

o Max registers / thread: 255(255)

o Max shared memory / multiprocessor: 64KB (16KB-48KB)

o Max shared memory / thread block: 48KB

 72 SM units with 64 CUDA cores each (4608 total), 10 tensor cores

 The latest one: Ampere.

float *h_A = (float *)malloc(size);

for (int i = 0; i < N; ++i) {

float *d_A, *d_B, *d_C;

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

int threadsPerBlock = 256;

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

for (int i = 0; i < N; ++i) {

You might also like

float h_A = (float )malloc(size);

float d_A, d_B, *d_C;