0% found this document useful (0 votes)

131 views38 pages

CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology

This document provides an introduction to CUDA programming basics, including: - CUDA installation and components like the CUDA toolkit and SDK - Kernel launches which execute functions on GPUs in a data-parallel manner - CUDA memory spaces like global, shared, and registers that kernels can access

Uploaded by

Mato Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

0% found this document useful (0 votes)

131 views38 pages

CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology

Uploaded by

Mato Nguyễn

We take content rights seriously. If you suspect this is your content, claim it here.

You are on page 1/ 38

High Performance Computing Center

Hanoi University of Science & Technology

CUDA Programming Basic

Duong Nhat Tan (dn.nhattan@gmail.com)

2012
Outline

 CUDA Installation

 Kernel launches

 Some specifics of GPU code

High Performance Computing Center 2

Design Goals
 Scale to 100’s of cores, 1000’s of parallel threads
 G80
 GT200
 Tesla
+ NVIDIA CUDA

 Fermi

 Let programmers focus on parallel algorithms

 Enable heterogeneous systems (i.e., CPU+GPU)

 CPU & GPU are separate devices with separate DRAM

High Performance Computing Center 3

GPU Computing with CUDA
 CUDA: Compute Unified Device Architect
 Application Development Environment for NVIDIA GPU
 Compiler, debugger, profiler, high-level programming languages
 Libraries (CUBLAS, CUFFT, ..) and Code Samples

CUDA C languages
 The extension of C/C++
 Data parallel programming
 Executing a thousands of processes in
parallel on GPUs
 Cost of synchronization is not expensive
CUDA Installation

 CUDA development tools consists of three

key components:
 CUDA driver
 CUDA toolkit
 Compiler, Assembler
 Libraries
 Documentation
 CUDA SDK

http://developer.nvidia.com/category/zone/cuda-zone

High Performance Computing Center 5

CUDA SDK

High Performance Computing Center 6

CUDA files
 Routines that call the device must be in plain C –
with extension .cu
 Often 2 files
 1) The kernel -- .cu file containing routines that are
running on the device.
 2) A .cu file that calls the routines from kernel. includes the
kernel
 Optional additional .cpp or .c files with other
routines can be linked

7
Compilation
 Any source file containing CUDA language
extensions must be compiled with NVCC

 NVCC is a compiler driver

 Works by invoking all the necessary tools and compilers like gcc,
g++, cl, ...

 NVCC outputs:
 C code (host CPU Code)‫‏‬
 PTX

8
Compilation

High Performance Computing Center 9

Conceptual Foundations
 Kernels: C functions, when called, are
executed by many CUDA threads
 Threads
 Each thread has a unique thread ID
 Threads can be 1D, 2D, or 3D
 Thread id is accessible within the kernel using
threadIdx variable
 Blocks
 A group of threads (1D, 2D, 3D)
 Block id is accessible within the kernel using
blockIdx variable
 Grids
 A group of blocks
 Define the total number of threads (N) can be
executed in parallel
 Threads in different block in the same grid cannot
directly communicate with each other

High Performance Computing Center 10

CUDA C kernel
kernel<<<numBlocks,ThreadsPerBlock>>>(…parameter list …);

// Kernel definition // function definition

__global__ void VecAdd( float * A, void VecAdd( float * A, float * B, float
float * B, float * C) * C)
{ {
int i = threadIdx.x; for (int i=0;i<N ; i++)
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }

int main() int main()

{ {
... ...
// Kernel invocation //function invocation
VecAdd<<<1, N>>>(A, B, C); VecAdd(A, B, C);
} }
Data Parallelism

 A CUDA kernel is executed by an array of

threads
 All threads run the same code
 Each thread has an ID that it uses to compute
memory addresses and make control decisions

12
CUDA kernel and thread
 Parallel portions of an application are executed on
the device as kernels
 One kernel is executed at a time
 Many threads execute each kernel
 Differences between CUDA and CPU threads
 CUDA threads are extremely lightweight
 Very little creation overhead
 Instant switching
 CUDA uses 1000s of threads to achieve efficiency
 Multi-core CPUs can use only a few

13
Kernel Memory Access
 Registers
 Global Memory
 Kernel input and output data reside
here
 Off-chip, large
 Uncached
 Shared Memory
 Shared among threads in a single
block
 On-chip, small
 As fast as registers
 The host can read & write global
memory but not shared memory
High Performance Computing Center 14
Hetegenerous Programming

High Performance Computing Center 15

Execution Model
Single Instruction Multiple Thread (SIMT) Execution:

• Groups of 32 threads formed

into warps
o always executing same instruction
o share instruction fetch/dispatch
o hardware automatically handles divergence

• Warps are primitive unit of scheduling

High Performance Computing Center 16

Execution Model

High Performance Computing Center 17

GPU Memory Allocation / Release

cudaMalloc(void ** pointer, size_t nbytes)

cudaMemset(void * pointer, int value, size_t count)
cudaFree(void* pointer)

int n = 1024;
int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);

High Performance Computing Center 18

Data copies
 cudaMemcpy(void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);
 Direction specifies locations (host or device) of src and dst
 Blocks CPU thread: returns after the copy is complete
 Doesn’t start copying until previous CUDA calls complete
 enum cudaMemcpyKind
 cudaMemcpyHostToDevice
 cudaMemcpyDeviceToHost
 cudaMemcpyDeviceToDevice

High Performance Computing Center 19

Copying between host
and device
 Part1: Allocate memory for pointers d_a and d_b on
the device.
 Part2: Copy h_a on the host to d_a on the device.
 Part3: Do a device to device copy from d_a to d_b.
 Part4: Copy d_b on the device back to h_a on the
host.
 Part5: Free d_a and d_b on the host

High Performance Computing Center 20

Outline

 CUDA Installation
 Kernel Launches
 Hand-On

High Performance Computing Center 21

Executing Code on the GPU
 Kernels are C functions with some restrictions
 Can only access GPU memory
 Must have void return type
 Not recursive
 No static variables

 Function arguments automatically copied from CPU

to GPU memory

High Performance Computing Center 22

Function Qualifiers
 __global__ :
 invoked from within host (CPU) code,
 cannot be called from device (GPU) code
 must return void
 __device__ :
 called from other GPU functions,
 cannot be called from host (CPU) code
 __host__ :
 can only be executed by CPU, called from host
 __host__ and __device__ qualifiers can be combined
 Sample use: overloading operators
 Compiler will generate both CPU and GPU code

High Performance Computing Center 23

Variable Qualifiers
 __device__
 Stored in device memory (large, high latency, no cache)
 Allocated with cudaMalloc (__device__ qualifier implied)
 Accessible by all threads
 Lifetime: application
 __shared__
 Stored in on-chip shared memory (very low latency)
 Allocated by execution configuration or at compile time
 Accessible by all threads in the same thread block
 Lifetime: kernel execution
 Unqualified variables:
 Scalars and built-in vector types are stored in registers
 What doesn’t fit in register spills to local memory

High Performance Computing Center 24

Launching kernels

 Modified C function call syntax:

kernel<<<dim3 grid, dim3 block>>>(…)

 Execution Configuration (“<<< >>>”):

 grid dimensions: x and y
 thread-block dimensions: x, y, and z

dim3 grid(16, 16);

dim3 block(16,16);
kernel<<<grid, block>>>(...);
kernel<<<32, 512>>>(...);

High Performance Computing Center 25

CUDA Built-in Device Variables

 All global and device functions have

access to these automatically defined variables

dim3 gridDim;
 Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
 Dimensions of the block in threads
dim3 blockIdx;
 Block index within the grid
dim3 threadIdx;
 Thread index within the block

High Performance Computing Center 26

Minimal Kernel

High Performance Computing Center 27

Example: Increment Array Elements

High Performance Computing Center 28

Example: Increment Array Elements

High Performance Computing Center 29

Host Synchronization
 All kernel launches are asynchronous
 control returns to CPU immediately
 kernel executes after all previous CUDA calls have
completed
 cudaMemcpy() is synchronous
 control returns to CPU after copy completes
 copy starts after all previous CUDA calls have completed
 cudaThreadSynchronize()
 blocks until all previous CUDA calls complete

High Performance Computing Center 30

Host Synchronization Example
// copy data from host to device
cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

// execute the kernel

increment_gpu<<< N/blockSize, blockSize>>>(d_A, b);

//run independent CPU code

run_cpu_stuff();

// copy data from device back to host

cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

31
Using shared memory

High Performance Computing Center 32

GPU Thread Synchronization
 void __syncthreads();
 Synchronizes all threads in a block
 Generates barrier synchronization instruction
 No thread can pass this barrier until all threads in the block reach
it
 Allowed in conditional code only if the conditional is uniform
across the entire thread block

int idx = blockIdx.x * blockDim.x + threadIdx.x

if (blockIdx.x == blockreverseArray) {
shared_data[blockDim.x – (threadIdx.x+1)] = a[idx]
_syncthreads();
a[idx] = shared_data[threadIdx.x];
}

High Performance Computing Center 33

CUDA Error Reporting to CPU

 All CUDA calls return error code:

 Except for kernel launches
 cudaError_t type
 cudaError_t cudaGetLastError(void)
 Returns the code for the last error (no error has a code)
 Can be used to get error from kernel execution
 char* cudaGetErrorString(cudaError_t code)
 Returns a null-terminated character string describing the
error

printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

High Performance Computing Center 34

My first kernel
 Part1: Allocate device memory for the result of the kernel
using pointer d_a.
 Part2: Configure and launch the kernel using a 1-D grid of 1-D
thread blocks.
 Part3: Have each thread set an element of d_a as follows:
idx = blockIdx.x*blockDim.x + threadIdx.x
d_a[idx] = 1000*blockIdx.x + threadIdx.x
 Part4: Copy the result in d_a back to the host pointer h_a.
 Part5: Verify that the result is correct

High Performance Computing Center 35

reverseArray_singleblock
 Given an input array {a0, a1, …, an-1} in pointer d_a, store the
reversed array {an-1, an-2, …, a0} in pointer d_b

 Only one thread block launched, to reverse an array of size

N = numThreads = 256 elements

 Part 1 (of 1): All you have to do is implement the body of the
kernel “reverseArrayBlock()”

 Each thread moves a single element to reversed position

 Read input from d_a pointer
 Store output in reversed location in d_b pointer

High Performance Computing Center 36

reverseArray_multiblock
 Given an input array {a0, a1, …, an-1} in pointer d_a, store the
reversed array {an-1, an-2, …, a0} in pointer d_b

 Multiple 256-thread blocks launched

 To reverse an array of size N, N/256 blocks

 Part 1: Compute the number of blocks to launch

 Part 2: Implement the kernel reverseArrayBlock()

 Note that now you must compute both

 The reversed location within the block
 The reversed offset to the start of the block

37
THANK YOU

High Performance Computing Center 38

Cuda C
No ratings yet
Cuda C
70 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
42 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
CUDA for Developers and Engineers
No ratings yet
CUDA for Developers and Engineers
28 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Simple CUDA Programming Guide
No ratings yet
Simple CUDA Programming Guide
4 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA
No ratings yet
CUDA
18 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
17 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Small Biz PoE Switch Solution
No ratings yet
Small Biz PoE Switch Solution
3 pages
Computer Network Lab: Lab 3: Ipv4 Address Subnetting
No ratings yet
Computer Network Lab: Lab 3: Ipv4 Address Subnetting
21 pages
Service Manual for Pioneer DEH-P5530MP
No ratings yet
Service Manual for Pioneer DEH-P5530MP
27 pages
New AC-DC Power Factor Correction
No ratings yet
New AC-DC Power Factor Correction
5 pages
Lect 4,5,6,7 MOSFET Construction, Working Principle SND Chsrscteridtics
100% (1)
Lect 4,5,6,7 MOSFET Construction, Working Principle SND Chsrscteridtics
26 pages
Switch Mode Power Supply Measurements: Application Note
No ratings yet
Switch Mode Power Supply Measurements: Application Note
24 pages
Structure of C Program
No ratings yet
Structure of C Program
3 pages
Rail To Rail High Output Current: Single Operational Amplifier
No ratings yet
Rail To Rail High Output Current: Single Operational Amplifier
3 pages
Automatic LED Emergency Light Guide
No ratings yet
Automatic LED Emergency Light Guide
2 pages
EXP No 9
No ratings yet
EXP No 9
5 pages
Internet Telephony PBX System (30/100/200/500 SIP Users Registrations) IPX-330/IPX-2100/IPX-2200/IPX-2500
No ratings yet
Internet Telephony PBX System (30/100/200/500 SIP Users Registrations) IPX-330/IPX-2100/IPX-2200/IPX-2500
16 pages
Sungrow Smart Meter Guide
No ratings yet
Sungrow Smart Meter Guide
2 pages
CSS MCQ Questions
100% (3)
CSS MCQ Questions
7 pages
Programming in Scheme
100% (2)
Programming in Scheme
241 pages
Index: Ashoka Rao Thumucherla
No ratings yet
Index: Ashoka Rao Thumucherla
279 pages
MPLS VPRN Setup: Nokia & Cisco Guide
No ratings yet
MPLS VPRN Setup: Nokia & Cisco Guide
24 pages
Aculload SA FMC - Specifications
No ratings yet
Aculload SA FMC - Specifications
13 pages
Fanuc Control Parameter Settings RS232
100% (1)
Fanuc Control Parameter Settings RS232
2 pages
Codex
No ratings yet
Codex
3 pages
EN19381664
No ratings yet
EN19381664
8 pages
User Manual Marine Master Clock
No ratings yet
User Manual Marine Master Clock
51 pages
Amplifer
No ratings yet
Amplifer
23 pages
RRN LaboratoryExercise004
No ratings yet
RRN LaboratoryExercise004
13 pages
GBSS16.0 KPI Reference (01) (PDF) - EN
No ratings yet
GBSS16.0 KPI Reference (01) (PDF) - EN
85 pages
Assignment 1 - Computer Vision Concepts Implementation
No ratings yet
Assignment 1 - Computer Vision Concepts Implementation
8 pages
Question Bank PDC
No ratings yet
Question Bank PDC
1 page
Poster SyncE Basics D99-00-004P A00
No ratings yet
Poster SyncE Basics D99-00-004P A00
1 page
Basspods P481 Product Manual 29.10.2022
No ratings yet
Basspods P481 Product Manual 29.10.2022
2 pages
2020 Servo Catalog DS5 New
No ratings yet
2020 Servo Catalog DS5 New
28 pages
Boasch Ignition Modules
100% (1)
Boasch Ignition Modules
3 pages

CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology

Uploaded by

CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology

Uploaded by

High Performance Computing Center

Hanoi University of Science & Technology

CUDA Programming Basic

Duong Nhat Tan (dn.nhattan@gmail.com)

 Some specifics of GPU code

High Performance Computing Center 2

 Let programmers focus on parallel algorithms

 Enable heterogeneous systems (i.e., CPU+GPU)

High Performance Computing Center 3

 CUDA development tools consists of three

High Performance Computing Center 5

High Performance Computing Center 6

 NVCC is a compiler driver

High Performance Computing Center 9

High Performance Computing Center 10

// Kernel definition // function definition

int main() int main()

 A CUDA kernel is executed by an array of

High Performance Computing Center 15

• Groups of 32 threads formed

• Warps are primitive unit of scheduling

High Performance Computing Center 16

High Performance Computing Center 17

cudaMalloc(void ** pointer, size_t nbytes)

High Performance Computing Center 18

High Performance Computing Center 19

High Performance Computing Center 20

High Performance Computing Center 21

 Function arguments automatically copied from CPU

High Performance Computing Center 22

High Performance Computing Center 23

High Performance Computing Center 24

 Modified C function call syntax:

 Execution Configuration (“<<< >>>”):

dim3 grid(16, 16);

High Performance Computing Center 25

 All __global__ and __device__ functions have

High Performance Computing Center 26

High Performance Computing Center 27

High Performance Computing Center 28

High Performance Computing Center 29

High Performance Computing Center 30

// execute the kernel

//run independent CPU code

// copy data from device back to host

High Performance Computing Center 32

int idx = blockIdx.x * blockDim.x + threadIdx.x

High Performance Computing Center 33

 All CUDA calls return error code:

printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

High Performance Computing Center 34

High Performance Computing Center 35

 Only one thread block launched, to reverse an array of size

 Each thread moves a single element to reversed position

High Performance Computing Center 36

 Multiple 256-thread blocks launched

 Part 1: Compute the number of blocks to launch

 Part 2: Implement the kernel reverseArrayBlock()

 Note that now you must compute both

High Performance Computing Center 38

You might also like

 All global and device functions have