CUDA
Programming Model
These notes will introduce:
•
Basic GPU programming model
•
CUDA kernel
•
Simple CUDA program to add two vectors together
•
Compiling the code on a Linux system
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 1
Programming Model
GPUs historically designed for creating image data for
displays.
That application involves manipulating image pixels
(picture elements) and often the same operation each
pixel
SIMD (single instruction multiple data) model - An
efficient mode of operation in which the same operation
is done on each data element at the same time
2
SIMD (Single Instruction Multiple Data)
model
Also know as data parallel computation.
One instruction specifies the operation:
Instruction
a[] = a[] + k
ALUs
a[0] a[1] a[n-2] a[n-1]
Very efficient of this is what you want to do. One program.
Can design computers to operate this way.
3
Single Instruction Multiple Thread
Programming Model
A version of SIMD used in GPUs.
GPUs use a thread model to achieve very high parallel
performance and to hide memory latency
Multiple threads, each execute the same instruction sequence.
On a GPU, a very large number of threads (10,000’s) possible.
Threads mapped onto available processors on GPU (100’s of
processors all executing same program sequence)
4
Programming applications
using SIMT model
Matrix operations -- very amenable to SIMT
•
Same operations done on different elements of matrices
Some “embarassingly” parallel computations such as
Monte Carlo calculations
•
Monte Carlo calculations use random selections
Random selections are independent of each other
Data manipulations
•
Some sorting can be done quite efficiently
…
5
CUDA kernel routine
To write a SIMT program, one needs to write a code
sequence that all the threads on the GPU will do.
In CUDA, this code sequence is called a Kernel routine
Kernal code will be regular C except one typically needs
to use the thread ID in expressions to ensure each thread
accesses different data:
Example
…
All theads do this
index = ThreadID;
A[index] = B[index] + C[index];
6
CPU and GPU memory
•
Program once compiled has code
executed on CPU and (kernel) code
executed on GPU CPU
CPU main memory
•
Separate memories on CPU and GPU
Copy from Copy from
CPU to GPU to
Need to GPU CPU
•
Explicitly transfer data from CPU to
GPU for GPU computation, and GPU global memory
GPU
•
Explicitly transfer results in GPU
memory copied back to CPU memory
7
Basic CUDA program structure
int main (int argc, char **argv ) {
1. Allocate memory space in device (GPU) for data
2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
(with CUDA syntax that defines no of threads and their physical structure)
5. Transfer results from GPU to CPU
6. Free memory space in device (GPU)
7. Free memory space in host (CPU)
return;
}
8
1. Allocating memory space in
“device” (GPU) for data
Use CUDA malloc routines:
int size = N *sizeof( int); // space for N integers
int *devA, *devB, *devC; // devA, devB, devC ptrs
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
9
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
or statically declare variables:
#define N 256
…
int a[N], b[N], c[N];
10
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
Destination Source
cudaMemcpy( devA, A, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, B, size, cudaMemcpyHostToDevice);
where:
devA and devB are pointers to destination in device
A and B are pointers to host data
11
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a syntax addition to C:
Triple angle brackets mark call from host code to device code.
Contains organization and number of threads in two parameters:
myKernel<<< n, m >>>(arg1, … );
n and m will define organization of thread blocks and threads in a
block.
For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.
arg1, … , -- arguments to routine myKernel typically pointers to
device memory obtained previously from cudaMallac.
12
Declaring a Kernel Routine
Two
A kernel defined using CUDA specifier __global__ underscores
each side
Example – Adding to vectors A and B
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition
int i = threadIdx.x; CUDA structure that provides thread ID in block
C[i] = A[i] + B[i];
} Each of the N threads performs one pair-
wise addition:
int main() { Thread 0: devC[0] = devA[0] + devB[0];
// allocate device memory & Thread 1: devC[1] = devA[1] + devB[1];
// copy data to device Thread N-1: devC[N-1] = devA[N-1]+devB[N-1];
// device mem. ptrs devA,devB,devC
vecAdd<<<1, N>>>(devA,devB,devC); // Grid of one block, N threads in block
…
}
13
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
5. Transferring data from device
(GPU) to host (CPU)
Use CUDA routine cudaMemcpy
Destination Source
cudaMemcpy( C, devC, size, cudaMemcpyDeviceToHost);
where:
devC is a pointer in device and C is a pointer in host.
14
6. Free memory space in “device”
(GPU)
Use CUDA cudaFree routine:
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
15
7. Free memory space in (CPU) host
(if CPU memory allocated with malloc)
Use regular C free routine to deallocate memory if
previously allocated with malloc:
free( a );
free( b );
free( c );
16
#define N 256
Complete
__global__ void vecAdd(int *A, int *B, int *C) {
CUDA int i = threadIdx.x;
C[i] = A[i] + B[i];
program }
int main (int argc, char **argv ) {
int size = N *sizeof( int);
Adding two int a[N], b[N], c[N], *devA, *devB, *devC;
vectors, A and cudaMalloc( (void**)&devA, size) );
B cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
N elements in A
and B, and cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( devB, b size, cudaMemcpyHostToDevice);
N threads vecAdd<<<1, N>>>(devA, devB, devC);
(without code to cudaMemcpy( c, devC size, cudaMemcpyDeviceToHost);
load arrays with
data) cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
return (0); 17
int main(int argc, char *argv[]) {
int T = 10, B = 1; // threads per block/blocks per grid
Complete, with int a[N],b[N],c[N];
int *dev_a, *dev_b, *dev_c;
keyboard input for printf("Size of array = %d\n", N);
blocks/threads do {
printf("Enter number of threads per block: ");
scanf("%d",&T);
printf("\nEnter nuumber of blocks per grid: ");
(without timing execution, scanf("%d",&B);
see later) if (T * B < N) printf("Error T x B < N, try again");
} while (T * B < N);
cudaMalloc((void**)&dev_a,N * sizeof(int));
cudaMalloc((void**)&dev_b,N * sizeof(int));
cudaMalloc((void**)&dev_c,N * sizeof(int));
#include <stdio.h>
#include <cuda.h> for(int i=0;i<N;i++) { // load arrays with some numbers
#include <stdlib.h> a[i] = i;
#include <time.h> b[i] = i*1;
}
#define N 4096 // size of array
cudaMemcpy(dev_a, a , N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b , N*sizeof(int),cudaMemcpyHostToDevice);
__global__ void add(int *a,int *b, int *c) { cudaMemcpy(dev_c, c , N*sizeof(int),cudaMemcpyHostToDevice);
int tid = blockIdx.x*blockDim.x +
threadIdx.x; add<<<B,T>>>(dev_a,dev_b,dev_c);
if(tid < N){ cudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);
c[tid] = a[tid]+b[tid];
} for(int i=0;i<N;i++) {
printf("%d+%d=%d\n",a[i],b[i],c[i]);
} }
cudaFree(dev_a); // clean up
cudaFree(dev_b);
cudaFree(dev_c);
18
return 0;
Compiling CUDA programs
“nvcc”
NVIDIA provides nvcc -- the NVIDIA CUDA “compiler
driver”.
Will separate out code for host and for device
Regular C/C++ compiler used for host (needs to be
available)
Programmer simply uses nvcc instead of gcc/cc compiler
on a Linux system
Command line options include for GPU features
19
Compiling code - Linux
Command line: Directories for #include files
nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include
–L/usr/local/cuda/lib –lcuda –lcudart
Optimization level if
you want optimized Directories for libraries Libraries to be linked
code
CUDA source file that includes device code has the extension .cu
nvcc separates code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make file convenient – see next.
See “The CUDA Compiler Driver NVCC” from NVIDIA for more details 20
Very simple sample Make file
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda
NVCCFLAGS = -I$(CUDAPATH)/include
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm
prog1: A regular C program
cc -o prog1 prog1.c –lm
A C program with X11 graphics
prog2:
cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib
-lX11 –lm
A CUDA program
prog3:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3 prog3.cu
A CUDA program with X11 graphics
prog4:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4
prog4.cu -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm
21
Compilation process
nvcc “wrapper” divides nvcc –o prog prog.cu –I/includepath -L/libpath
code into host and
device parts.
nvcc
Host part compiled by
regular C compiler
ptxas gcc
Device part compiled
by NVIDIA “ptxas” Combine
assembler Object file
Two compiled parts executable
combined into one
executable Executable file a “fat” binary” with
both host and device code 22
Executing Program
Simple type name of executable created by nvcc:
./prog1
File includes all the code for host and for device in a “fat binary” file
Host code starts running
When first encounter device kernel, GPU code physically sent to
GPU and function launched on GPU
Hence first launch will be slow!!
Run time environment (cudart) controls memcpy timing and
synchronization
23
Questions