CUDA: Introduction
Christian Trefftz / Greg Wolffe
Grand Valley State University
Supercomputing 2008
Education Program
(modifications by Jernej Barbic, 2008-2019)
Terms
Ø What is GPGPU?
l General-Purpose computing on a Graphics
Processing Unit
l Using graphic hardware for non-graphic
computations
Ø What is CUDA?
l Parallel computing platform and API by Nvidia
l Compute Unified Device Architecture
l Software architecture for managing data-parallel
programming
l Introduced in 2007; still actively updated 2
Motivation
3
Motivation
4
Motivation
5
CPU vs. GPU
Ø CPU
l Fast caches
l Branching adaptability
l High performance
Ø GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
• Executes program on each fragment/vertex
Ø CPUs are great for task parallelism
Ø GPUs are great for data parallelism
6
CPU vs. GPU - Hardware
Ø More transistors devoted to data processing
7
Traditional Graphics Pipeline
Vertex processing
ò
Rasterizer
ò
Fragment processing
ò
Renderer (textures)
8
Pixel / Thread Processing
9
GPU Architecture
10
Processing Element
Ø Processing element = thread processor
11
GPU Memory Architecture
Uncached:
Ø Registers
Ø Shared Memory
Ø Local Memory
Ø Global Memory
Cached:
Ø Constant Memory
Ø Texture Memory
12
Data-parallel Programming
Ø Think of the GPU as a massively-threaded
co-processor
Ø Write “kernel” functions that execute on
the device -- processing multiple data
elements in parallel
Ø Keep it busy! [ massive threading
Ø Keep your data close! [ local memory
13
Hardware Requirements
Ø CUDA-capable
video card
Ø Power supply
Ø Cooling
Ø PCI-Express
14
A Gentle Introduction to
CUDA Programming
17
Credits
Ø Thecode used in this presentation is based
on code available in:
l the Tutorial on CUDA in Dr. Dobbs Journal
l Andrew Bellenir’s code for matrix multiplication
l Igor Majdandzic’s code for Voronoi diagrams
l NVIDIA’s CUDA programming guide
18
Software Requirements/Tools
Ø CUDA device driver
Ø CUDA Toolkit (compiler, CUBLAS, CUFFT)
Ø CUDA Software Development Kit
l Emulator
Profiling:
Ø Occupancy calculator
Ø Visual profiler
19
To compute, we need to:
Ø Allocate memory for the computation
on the GPU (incl. variables)
Ø Provide input data
Ø Specify the computation to be performed
Ø Read the results from the GPU (output)
20
Initially:
array
CPU Memory GPU Card’s Memory
21
Allocate Memory in the GPU
card
array array_d
Host’s Memory GPU Card’s Memory
22
Copy content from the host’s memory to the
GPU card memory
array array_d
Host’s Memory GPU Card’s Memory
23
Execute code on the GPU
GPU MPs
array array_d
Host’s Memory GPU Card’s Memory
24
Copy results back to the host
memory
array array_d
Host’s Memory GPU Card’s Memory
25
The Kernel
Ø The code to be executed in the
stream processors on the GPU
Ø Simultaneous execution in
several (perhaps all) stream
processors on the GPU
Ø How is every instance of the
kernel going to know which
piece of data it is working on?
26
Grid and Block Size
l Grid size: The number of blocks
• Can be 1 or 2-dimensional array of blocks
l Each block is divided into threads
• Can be 1, 2, or 3-dimensional array of threads
27
Let’s look at a very simple example
Ø The code has been divided into two files:
l simple.c
l simple.cu
Ø simple.c is ordinary code in C
Ø It allocates an array of integers, initializes
it to values corresponding to the indices in
the array and prints the array.
Ø It calls a function that modifies the array
Ø The array is printed again.
28
simple.c
Ø
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=0;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function in the GPU that will fill
the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}
29
simple.cu
Ø simple.cu contains two functions
l fillArray(): A function that will be executed on
the host and which takes care of:
• Allocating variables in the global GPU memory
• Copying the array from the host to the GPU memory
• Setting the grid and block sizes
• Invoking the kernel that is executed on the GPU
• Copying the values back to the host memory
• Freeing the GPU memory
30
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array, int arraySize)
{
int * array_d;
cudaError_t result;
/* cudaMalloc allocates space in GPU memory */
result =
cudaMalloc((void**)&array_d,sizeof(int)*arraySize);
/* copy the CPU array into the GPU array_d */
result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,
cudaMemcpyHostToDevice);
31
fillArray (part 2)
/* Indicate block size */
dim3 dimblock(BLOCK_SIZE);
/* Indicate grid size */
dim3 dimgrid(arraySize / BLOCK_SIZE);
/* Call the kernel */
cu_fillArray<<<dimgrid,dimblock>>>(array_d);
/* Copy the results from GPU back to CPU memory */
result =
cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice
ToHost);
/* Release the GPU memory */
cudaFree(array_d);
}
32
simple.cu (cont.)
Ø The other function in simple.cu is cu_fillArray():
l This is the GPU kernel
l Identified by the keyword: __global__
l Built-in variables:
• blockIdx.x : block index within the grid
• threadIdx.x: thread index within the block
33
cu_fillArray
__global__ void cu_fillArray(int * array_d)
{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d[x] = x;
}
__global__ void cu_addIntegers(int * array_d1, int * array_d2)
{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d1[x] += array_d2[x];
}
34
To compile:
Ø nvcc simple.c simple.cu –o simple
Ø The compiler generates the code for both
the host and the GPU
Ø Demo on cuda.littlefe.net …
35
In the GPU:
Processing Elements
Thread Thread Thread Thread Thread Thread Thread Thread
0 1 2 3 0 1 2 3
Array Elements
Block 0 Block 1
37
Another Example: saxpy
Ø SAXPY (Scalar Alpha X Plus Y)
l A common operation in linear algebra
Ø CUDA: loop iteration ð thread
41
Traditional Sequential Code
void saxpy_serial(int n,
float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)
y[i] = alpha*x[i] + y[i];
}
42
CUDA Code
__global__ void saxpy_parallel(int n,
float alpha,
float *x,
float *y) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i] = alpha*x[i] + y[i];
}
43
“Warps”
Ø Each block is split into SIMD groups of threads
called "warps".
Ø Each warp contains the same number of threads,
called the "warp size”
44
warp 1
warp 2 Block 1
warp 3
threads
warp 1
Block 2
warp 2
warp 3
warp 1
Block 3
warp 2
warp 3
Multi-processor 1
warp 1
Block 4
warp 2
warp 3
45
Keeping multiprocessors in mind…
Ø Each multiprocessor can process multiple blocks at a
time.
Ø How many depends on the number of registers per
thread and how much shared memory per block is
required by a given kernel.
Ø If a block is too large, it will not fit into the resources of
an MP.
46
Performance Tip: Block Size
Ø Critical for performance
Ø Recommended value is 192 or 256
Ø Maximum value is 512
Ø Should be a multiple of 32 since this is the warp
size for Series 8 GPUs and thus the native
execution size for multiprocessors
Ø Limited by number of registers on the MP
Ø Series 8 GPU MPs have 8192 registers which
are shared between all the threads on an MP
47
Performance Tip:
Grid Size (number of blocks)
Ø Recommended value is at least 100, but 1000 would
scale for many generations of hardware
Ø Actual value depends on problem size
Ø It should be a multiple of the number of MPs for an even
distribution of work (not a requirement though)
Ø Example: 24 blocks
l Grid will work efficiently on Series 8 (12 MPs), but it will waste
resources on new GPUs with 32MPs
48
Example: Tesla P100
Ø Launched in 2016
Ø “Pascal” architecture (successors: Volta, Turing)
Ø Double-precision performance: 4.7 TeraFLOPS
Ø Single-precision performance: 9.3 TeraFLOPS
Ø GPU Memory: 16 GB
49
Example: Tesla P100
Ø Number of Multiprocessors (MPs): 56
Ø Number of Cuda Cores per MP: 64
Ø Total number of Cuda Cores: 3584
Ø #Cuda Cores = #number of floating point
instructions that can be processed per cycle
Ø MPs can run multiple threads per core
simultaneously (similar to hyperthreading on CPU)
Ø Hence, #threads can be larger than #cores
50
Memory Alignment
Ø Memory access faster if data aligned at 64
byte boundaries
Ø Hence, allocate 2D arrays so that every
row starts at a 64-byte boundary
Ø Tedious for a programmer
51
Allocating 2D arrays with “pitch”
Ø CUDA offers special versions of:
l Memory allocation of 2D arrays so that every row
is padded (if necessary): cudaMallocPitch()
l Memory copy operations that take into account the
pitch: cudaMemcpy2D()
52
Pitch
Columns
Padding
Rows
Pitch
53
Dividing the work by blocks:
Columns
Block 0
Rows Block 1
Block 2
Pitch
60
Watchdog timer
Ø OS may force programs using the GPU to time out if
running too long
Ø Exceeding the limit can cause CUDA program
failure.
Ø Possible solution: run CUDA on a GPU that is NOT
attached to a display.
65
Resources on line
Ø http://www.acmqueue.org/modules.php?name=
Content&pa=showpage&pid=532
Ø http://www.ddj.com/hpc-high-performance-
computing/207200659
Ø http://www.nvidia.com/object/cuda_home.html#
Ø http://www.nvidia.com/object/cuda_learn.html
Ø “Computation of Voronoi diagrams using a
graphics processing unit” by Igor Majdandzic et
al. available through IEEE Digital Library, DOI:
10.1109/EIT.2008.4554342
66