High Performance Computing Center
Hanoi University of Science & Technology
CUDA Programming Basic
Duong Nhat Tan (dn.nhattan@gmail.com)
2012
Outline
CUDA Installation
Kernel launches
Some specifics of GPU code
High Performance Computing Center 2
Design Goals
Scale to 100’s of cores, 1000’s of parallel threads
G80
GT200
Tesla
+ NVIDIA CUDA
Fermi
Let programmers focus on parallel algorithms
Enable heterogeneous systems (i.e., CPU+GPU)
CPU & GPU are separate devices with separate DRAM
High Performance Computing Center 3
GPU Computing with CUDA
CUDA: Compute Unified Device Architect
Application Development Environment for NVIDIA GPU
Compiler, debugger, profiler, high-level programming languages
Libraries (CUBLAS, CUFFT, ..) and Code Samples
CUDA C languages
The extension of C/C++
Data parallel programming
Executing a thousands of processes in
parallel on GPUs
Cost of synchronization is not expensive
CUDA Installation
CUDA development tools consists of three
key components:
CUDA driver
CUDA toolkit
Compiler, Assembler
Libraries
Documentation
CUDA SDK
http://developer.nvidia.com/category/zone/cuda-zone
High Performance Computing Center 5
CUDA SDK
High Performance Computing Center 6
CUDA files
Routines that call the device must be in plain C –
with extension .cu
Often 2 files
1) The kernel -- .cu file containing routines that are
running on the device.
2) A .cu file that calls the routines from kernel. includes the
kernel
Optional additional .cpp or .c files with other
routines can be linked
7
Compilation
Any source file containing CUDA language
extensions must be compiled with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like gcc,
g++, cl, ...
NVCC outputs:
C code (host CPU Code)
PTX
8
Compilation
High Performance Computing Center 9
Conceptual Foundations
Kernels: C functions, when called, are
executed by many CUDA threads
Threads
Each thread has a unique thread ID
Threads can be 1D, 2D, or 3D
Thread id is accessible within the kernel using
threadIdx variable
Blocks
A group of threads (1D, 2D, 3D)
Block id is accessible within the kernel using
blockIdx variable
Grids
A group of blocks
Define the total number of threads (N) can be
executed in parallel
Threads in different block in the same grid cannot
directly communicate with each other
High Performance Computing Center 10
CUDA C kernel
kernel<<<numBlocks,ThreadsPerBlock>>>(…parameter list …);
// Kernel definition // function definition
__global__ void VecAdd( float * A, void VecAdd( float * A, float * B, float
float * B, float * C) * C)
{ {
int i = threadIdx.x; for (int i=0;i<N ; i++)
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
int main() int main()
{ {
... ...
// Kernel invocation //function invocation
VecAdd<<<1, N>>>(A, B, C); VecAdd(A, B, C);
} }
Data Parallelism
A CUDA kernel is executed by an array of
threads
All threads run the same code
Each thread has an ID that it uses to compute
memory addresses and make control decisions
12
CUDA kernel and thread
Parallel portions of an application are executed on
the device as kernels
One kernel is executed at a time
Many threads execute each kernel
Differences between CUDA and CPU threads
CUDA threads are extremely lightweight
Very little creation overhead
Instant switching
CUDA uses 1000s of threads to achieve efficiency
Multi-core CPUs can use only a few
13
Kernel Memory Access
Registers
Global Memory
Kernel input and output data reside
here
Off-chip, large
Uncached
Shared Memory
Shared among threads in a single
block
On-chip, small
As fast as registers
The host can read & write global
memory but not shared memory
High Performance Computing Center 14
Hetegenerous Programming
High Performance Computing Center 15
Execution Model
Single Instruction Multiple Thread (SIMT) Execution:
• Groups of 32 threads formed
into warps
o always executing same instruction
o share instruction fetch/dispatch
o hardware automatically handles divergence
• Warps are primitive unit of scheduling
High Performance Computing Center 16
Execution Model
High Performance Computing Center 17
GPU Memory Allocation / Release
cudaMalloc(void ** pointer, size_t nbytes)
cudaMemset(void * pointer, int value, size_t count)
cudaFree(void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
High Performance Computing Center 18
Data copies
cudaMemcpy(void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);
Direction specifies locations (host or device) of src and dst
Blocks CPU thread: returns after the copy is complete
Doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
High Performance Computing Center 19
Copying between host
and device
Part1: Allocate memory for pointers d_a and d_b on
the device.
Part2: Copy h_a on the host to d_a on the device.
Part3: Do a device to device copy from d_a to d_b.
Part4: Copy d_b on the device back to h_a on the
host.
Part5: Free d_a and d_b on the host
High Performance Computing Center 20
Outline
CUDA Installation
Kernel Launches
Hand-On
High Performance Computing Center 21
Executing Code on the GPU
Kernels are C functions with some restrictions
Can only access GPU memory
Must have void return type
Not recursive
No static variables
Function arguments automatically copied from CPU
to GPU memory
High Performance Computing Center 22
Function Qualifiers
__global__ :
invoked from within host (CPU) code,
cannot be called from device (GPU) code
must return void
__device__ :
called from other GPU functions,
cannot be called from host (CPU) code
__host__ :
can only be executed by CPU, called from host
__host__ and __device__ qualifiers can be combined
Sample use: overloading operators
Compiler will generate both CPU and GPU code
High Performance Computing Center 23
Variable Qualifiers
__device__
Stored in device memory (large, high latency, no cache)
Allocated with cudaMalloc (__device__ qualifier implied)
Accessible by all threads
Lifetime: application
__shared__
Stored in on-chip shared memory (very low latency)
Allocated by execution configuration or at compile time
Accessible by all threads in the same thread block
Lifetime: kernel execution
Unqualified variables:
Scalars and built-in vector types are stored in registers
What doesn’t fit in register spills to local memory
High Performance Computing Center 24
Launching kernels
Modified C function call syntax:
kernel<<<dim3 grid, dim3 block>>>(…)
Execution Configuration (“<<< >>>”):
grid dimensions: x and y
thread-block dimensions: x, y, and z
dim3 grid(16, 16);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
kernel<<<32, 512>>>(...);
High Performance Computing Center 25
CUDA Built-in Device Variables
All __global__ and __device__ functions have
access to these automatically defined variables
dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
High Performance Computing Center 26
Minimal Kernel
High Performance Computing Center 27
Example: Increment Array Elements
High Performance Computing Center 28
Example: Increment Array Elements
High Performance Computing Center 29
Host Synchronization
All kernel launches are asynchronous
control returns to CPU immediately
kernel executes after all previous CUDA calls have
completed
cudaMemcpy() is synchronous
control returns to CPU after copy completes
copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
blocks until all previous CUDA calls complete
High Performance Computing Center 30
Host Synchronization Example
// copy data from host to device
cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
// execute the kernel
increment_gpu<<< N/blockSize, blockSize>>>(d_A, b);
//run independent CPU code
run_cpu_stuff();
// copy data from device back to host
cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);
31
Using shared memory
High Performance Computing Center 32
GPU Thread Synchronization
void __syncthreads();
Synchronizes all threads in a block
Generates barrier synchronization instruction
No thread can pass this barrier until all threads in the block reach
it
Allowed in conditional code only if the conditional is uniform
across the entire thread block
int idx = blockIdx.x * blockDim.x + threadIdx.x
if (blockIdx.x == blockreverseArray) {
shared_data[blockDim.x – (threadIdx.x+1)] = a[idx]
_syncthreads();
a[idx] = shared_data[threadIdx.x];
}
High Performance Computing Center 33
CUDA Error Reporting to CPU
All CUDA calls return error code:
Except for kernel launches
cudaError_t type
cudaError_t cudaGetLastError(void)
Returns the code for the last error (no error has a code)
Can be used to get error from kernel execution
char* cudaGetErrorString(cudaError_t code)
Returns a null-terminated character string describing the
error
printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );
High Performance Computing Center 34
My first kernel
Part1: Allocate device memory for the result of the kernel
using pointer d_a.
Part2: Configure and launch the kernel using a 1-D grid of 1-D
thread blocks.
Part3: Have each thread set an element of d_a as follows:
idx = blockIdx.x*blockDim.x + threadIdx.x
d_a[idx] = 1000*blockIdx.x + threadIdx.x
Part4: Copy the result in d_a back to the host pointer h_a.
Part5: Verify that the result is correct
High Performance Computing Center 35
reverseArray_singleblock
Given an input array {a0, a1, …, an-1} in pointer d_a, store the
reversed array {an-1, an-2, …, a0} in pointer d_b
Only one thread block launched, to reverse an array of size
N = numThreads = 256 elements
Part 1 (of 1): All you have to do is implement the body of the
kernel “reverseArrayBlock()”
Each thread moves a single element to reversed position
Read input from d_a pointer
Store output in reversed location in d_b pointer
High Performance Computing Center 36
reverseArray_multiblock
Given an input array {a0, a1, …, an-1} in pointer d_a, store the
reversed array {an-1, an-2, …, a0} in pointer d_b
Multiple 256-thread blocks launched
To reverse an array of size N, N/256 blocks
Part 1: Compute the number of blocks to launch
Part 2: Implement the kernel reverseArrayBlock()
Note that now you must compute both
The reversed location within the block
The reversed offset to the start of the block
37
THANK YOU
High Performance Computing Center 38