GPU Teaching Kit
Accelerated Computing
Lecture 2.1 - Introduction to CUDA C
CUDA C vs. Thrust vs. CUDA Libraries
Objective
– To learn the main venues and developer resources
for GPU computing
– Where CUDA C fits in the big picture
2
3 Ways to Accelerate Applications
Applications
Compiler Programming
Libraries
Directives Languages
Easy to use Easy to use Most Performance
Most Performance Portable code Most Flexibility
3
Libraries: Easy, High-Quality Acceleration
Ease of use: Using libraries enables GPU acceleration without in-
depth knowledge of GPU programming
“Drop-in”: Many GPU-accelerated libraries follow standard APIs,
thus enabling acceleration with minimal code changes
Quality: Libraries offer high-quality implementations of functions
encountered in a broad range of applications
4
GPU Accelerated Libraries
Linear Algebra NVIDIA
cuFFT,
FFT, BLAS, cuBLAS,
SPARSE, Matrix cuSPARSE
Numerical & Math NVIDIA
Math
NVIDIA
cuRAND
RAND, Statistics Lib
Data Struct. & AI GPU AI –
Board
GPU AI –
Path
Sort, Scan, Zero Sum Games Finding
NVIDIA
Visual Processing Video
NVIDIA Encode
Image & Video NPP
5
Vector Addition in Thrust
thrust::device_vector<float> deviceInput1(inputLength);
thrust::device_vector<float> deviceInput2(inputLength);
thrust::device_vector<float> deviceOutput(inputLength);
thrust::copy(hostInput1, hostInput1 + inputLength,
deviceInput1.begin());
thrust::copy(hostInput2, hostInput2 + inputLength,
deviceInput2.begin());
thrust::transform(deviceInput1.begin(), deviceInput1.end(),
deviceInput2.begin(), deviceOutput.begin(),
thrust::plus<float>());
6
Compiler Directives: Easy, Portable
Acceleration
Ease of use: Compiler takes care of details of parallelism
management and data movement
Portable: The code is generic, not specific to any type of hardware
and can be deployed into multiple languages
Uncertain: Performance of code can vary across compiler versions
7
OpenACC
– Compiler directives for C, C++, and FORTRAN
#pragma acc parallel loop
copyin(input1[0:inputLength],input2[0:inputLength]),
copyout(output[0:inputLength])
for(i = 0; i < inputLength; ++i) {
output[i] = input1[i] + input2[i];
}
8
Programming Languages: Most Performance and
Flexible Acceleration
Performance: Programmer has best control of parallelism and
data movement
Flexible: The computation does not need to fit into a limited set of
library patterns or directive types
Verbose: The programmer often needs to express more details
9
GPU Programming Languages
Numerical analytics MATLAB, Mathematica, LabVIEW
Fortran CUDA Fortran
C CUDA C
C++ CUDA C++
Python PyCUDA, Copperhead, Numba
F# Alea.cuBase
10
CUDA - C
Applications
Compiler Programming
Libraries
Directives Languages
Easy to use Easy to use Most Performance
Most Performance Portable code Most Flexibility
11
GPU Teaching Kit
Accelerated Computing
Lecture 2.2 - Introduction to CUDA C
Memory Allocation and Data Movement API Functions
Objective
– To learn the basic API functions in CUDA host code
– Device Memory Allocation
– Host-Device Data Transfer
2
Vector Addition – Traditional C Code
// Compute vector sum C = A + B
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
…
vecAdd(h_A, h_B, h_C, N);
}
4
4
Heterogeneous Computing vecAdd CUDA Host Code
Part 1
Device Memory #include <cuda.h>
Host Memory Part 2
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
CPU GPU
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
// Part 1
Part 3 // Allocate device memory for A, B, and C
// copy A and B to device memory
// Part 2
// Kernel launch code – the device performs the actual vector addition
// Part 3
// copy C from the device memory
// Free device vectors
}
5
Partial Overview of CUDA Memories
– Device code can:
(Device) Grid – R/W per-thread registers
Block (0, 0) Block (0, 1) – R/W all-shared global
Registers Registers Registers Registers
memory
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Host code can
Host
Global
– Transfer data to/from per
Memory grid global memory
We will cover more memory types and more
sophisticated memory models later.
6
CUDA Device Memory Management API functions
– cudaMalloc()
(Device) Grid – Allocates an object in the device
global memory
Block (0, 0) Block (0, 1)
– Two parameters
Registers Registers Registers Registers – Address of a pointer to the
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
allocated object
– Size of allocated object in terms
Host of bytes
Global
Memory – cudaFree()
– Frees object from device global
memory
– One parameter
– Pointer to freed object
7
Host-Device Data Transfer API functions
– cudaMemcpy()
(Device) Grid – memory data transfer
Block (0, 0) Block (0, 1) – Requires four parameters
– Pointer to destination
Registers Registers Registers Registers
– Pointer to source
Thread (0, 0) Thread (0, 1) Thread (0, 0) Thread (0, 1)
– Number of bytes copied
– Type/Direction of transfer
Host
Global
Memory
– Transfer to device is asynchronous
8
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code – to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
9
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(“%s in %s at line %d\n”, cudaGetErrorString(err), __FILE__,
__LINE__);
exit(EXIT_FAILURE);
}
10
10
GPU Teaching Kit
Accelerated Computing
Lecture 2.3 – Introduction to CUDA C
Threads and Kernel Functions
Objective
– To learn about CUDA threads, the main mechanism for exploiting of
data parallelism
– Hierarchical thread organization
– Launching parallel execution
– Thread index to data index mapping
2
Data Parallelism - Vector Addition Example
vector A A[0] A[1] A[2] … A[N-1]
vector B B[0] B[1] B[2] … B[N-1]
+ + + +
vector C C[0] C[1] C[2] … C[N-1]
3
CUDA Execution Model
– Heterogeneous host (CPU) + device (GPU) application C program
– Serial parts in host C code
– Parallel parts in device SPMD kernel code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args); ...
4
From Natural Language to Electrons
Natural Language (e.g, English)
Algorithm
High-Level Language (C/C++…)
Compiler
Instruction Set Architecture
Microarchitecture
Circuits
Electrons
©Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond
5
A program at the ISA level
– A program is a set of instructions stored in memory that can be read,
interpreted, and executed by the hardware.
– Both CPUs and GPUs are designed based on (different) instruction sets
– Program instructions operate on data stored in memory and/or
registers.
6
A Thread as a Von-Neumann Processor
A thread is a “virtualized” or
“abstracted”
Von-Neumann Processor
Memory
I/O
Processing Unit
Reg
ALU File
Control Unit
PC IR
7
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (Single Program Multiple Data)
– Each thread has indexes that it uses to compute memory addresses and make
control decisions
0 1 2 254 255
…
i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
8
Thread Blocks: Scalable Cooperation
Thread Block 0 Thread Block 1 Thread Block N-1
0 1 2 254 255 0 1 2 254 255 0 1 2 254 255
… … …
i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x + i = blockIdx.x * blockDim.x +
…
threadIdx.x; threadIdx.x; threadIdx.x;
C[i] = A[i] + B[i]; C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
… … …
– Divide thread array into multiple blocks
– Threads within a block cooperate via shared memory, atomic operations and
barrier synchronization
– Threads in different blocks do not interact
9
blockIdx and threadIdx
• Each thread uses indices to decide what data to work
on
– blockIdx: 1D, 2D, or 3D (CUDA 4.0)
– threadIdx: 1D, 2D, or 3D
device
• Simplifies memory Grid Block (0, Block (0,
addressing when processing 0) 1)
multidimensional data Block (1, Block (1,
– Image processing 0) 1)
– Solving PDEs on volumes
– … Block (1,1)
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
Thread Thread Thread 10 Thread
(0,0,0) (0,0,1) (0,0,2) (0,0,3)
Thread
Thread Thread Thread (0,0,0)
Thread
(0,1,0) (0,1,1) (0,1,2) (0,1,3)
10
GPU Teaching Kit
Accelerated Computing
Lecture 2.4 – Introduction to CUDA C
Introduction to the CUDA Toolkit
Objective
– To become familiar with some valuable tools and resources from the
CUDA Toolkit
– Compiler flags
– Debuggers
– Profilers
2
GPU Programming Languages
Numerical analytics MATLAB, Mathematica, LabVIEW
Fortran CUDA Fortran
C CUDA C
C++ CUDA C++
Python PyCUDA, Copperhead, Numba, NumbaPro
F# Alea.cuBase
3
CUDA - C
Applications
Compiler Programming
Libraries
Directives Languages
Easy to use Easy to use Most Performance
Most Performance Portable code Most Flexibility
4
NVCC Compiler
– NVIDIA provides a CUDA-C compiler
– nvcc
– NVCC compiles device code then forwards code on to the host
compiler (e.g. g++)
– Can be used to compile & link host only applications
5
Example 1: Hello World
int main() {
printf("Hello World!\n");
return 0;
}
Instructions:
1. Build and run the hello world code
2. Modify Makefile to use nvcc
instead of g++
3. Rebuild and run
6
CUDA Example 1: Hello World
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
Instructions:
1. Add kernel and kernel launch to
main.cu
2. Try to build
7
CUDA Example 1: Build Considerations
– Build failed
– Nvcc only parses .cu files for CUDA
– Fixes:
– Rename main.cc to main.cu
OR
– nvcc –x cu
– Treat all input files as .cu files
Instructions:
1. Rename main.cc to main.cu
2. Rebuild and Run
8
Hello World! with Device Code
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
Output:
$ nvcc main.cu
$ ./a.out
Hello World!
– mykernel(does nothing, somewhat anticlimactic!)
9
Developer Tools - Debuggers
NSIGHT CUDA-GDB CUDA MEMCHECK
NVIDIA Provided
3rd Party
https://developer.nvidia.com/debugging-solutions
10
Compiler Flags
– Remember there are two compilers being used
– NVCC: Device code
– Host Compiler: C/C++ code
– NVCC supports some host compiler flags
– If flag is unsupported, use –Xcompiler to forward to host
– e.g. –Xcompiler –fopenmp
– Debugging Flags
– -g: Include host debugging symbols
– -G: Include device debugging symbols
– -lineinfo: Include line information with symbols
11
CUDA-MEMCHECK
– Memory debugging tool
– No recompilation necessary
%> cuda-memcheck ./exe
– Can detect the following errors
– Memory leaks
– Memory errors (OOB, misaligned access, illegal instruction, etc)
– Race conditions
– Illegal Barriers
– Uninitialized Memory
– For line numbers use the following compiler flags:
– -Xcompiler -rdynamic -lineinfo
http://docs.nvidia.com/cuda/cuda-memcheck
12
Example 2: CUDA-MEMCHECK
Instructions:
1. Build & Run Example 2
Output should be the numbers 0-9
Do you get the correct results?
2. Run with cuda-memcheck
%> cuda-memcheck ./a.out
3. Add nvcc flags “–Xcompiler –
rdynamic –lineinfo”
4. Rebuild & Run with cuda-memcheck
5. Fix the illegal write
http://docs.nvidia.com/cuda/cuda-memcheck
13
CUDA-GDB
– cuda-gdb is an extension of GDB
– Provides seamless debugging of CUDA and CPU code
– Works on Linux and Macintosh
– For a Windows debugger use NSIGHT Visual Studio Edition
http://docs.nvidia.com/cuda/cuda-gdb
14
Example 3: cuda-gdb
Instructions:
1. Run exercise 3 in cuda-gdb
%> cuda-gdb --args ./a.out
2. Run a few cuda-gdb commands:
(cuda-gdb) b main //set break point at main
(cuda-gdb) r //run application
(cuda-gdb) l //print line context
(cuda-gdb) b foo //break at kernel foo
(cuda-gdb) c //continue
(cuda-gdb) cuda thread //print current thread
(cuda-gdb) cuda thread 10 //switch to thread 10
(cuda-gdb) cuda block //print current block
(cuda-gdb) cuda block 1 //switch to block 1
(cuda-gdb) d //delete all break points
(cuda-gdb) set cuda memcheck on //turn on cuda memcheck
(cuda-gdb) r //run from the beginning
3. Fix Bug
http://docs.nvidia.com/cuda/cuda-gdb
15
Developer Tools - Profilers
NSIGHT NVVP NVPROF
NVIDIA Provided
TAU VampirTrace
3rd Party
https://developer.nvidia.com/performance-analysis-tools
16
NVPROF
Command Line Profiler
– Compute time in each kernel
– Compute memory transfer time
– Collect metrics and events
– Support complex process hierarchy's
– Collect profiles for NVIDIA Visual Profiler
– No need to recompile
17
Example 4: nvprof
Instructions:
1. Collect profile information for the matrix add
example
%> nvprof ./a.out
2. How much faster is add_v2 than add_v1?
3. View available metrics
%> nvprof --query-metrics
4. View global load/store efficiency
%> nvprof --metrics
gld_efficiency,gst_efficiency ./a.out
5. Store a timeline to load in NVVP
%> nvprof –o profile.timeline ./a.out
6. Store analysis metrics to load in NVVP
%> nvprof –o profile.metrics --analysis-metrics
./a.out
18
NVIDIA’s Visual Profiler (NVVP)
Timeline
Guided
System Analysis
19
Example 4: NVVP
Instructions:
1. Import nvprof profile into NVVP
Launch nvvp
Click File/ Import/ Nvprof/ Next/ Single
process/ Next / Browse
Select profile.timeline
Add Metrics to timeline
Click on 2nd Browse
Select profile.metrics
Click Finish
2. Explore Timeline
Control + mouse drag in timeline to zoom in
Control + mouse drag in measure bar (on top)
to measure time
20
Example 4: NVVP
Instructions:
1. Click on a kernel
2. On Analysis tab click on the unguided analysis
2. Click Analyze All
Explore metrics and properties
What differences do you see between the two
kernels?
Note:
If kernel order is non-deterministic you can only load the timeline or the metrics
but not both.
If you load just metrics the timeline looks odd but metrics are correct.
21
Example 4: NVVP
Let’s now generate the same data within NVVP
Instructions:
1. Click File / New Session / Browse
Select Example 4/a.out
Click Next / Finish
2. Click on a kernel
Select Unguided Analysis
Click Analyze All
22
NVTX
– Our current tools only profile API calls on the host
– What if we want to understand better what the host is doing?
– The NVTX library allows us to annotate profiles with ranges
– Add: #include <nvToolsExt.h>
– Link with: -lnvToolsExt
– Mark the start of a range
– nvtxRangePushA(“description”);
– Mark the end of a range
– nvtxRangePop();
– Ranges are allowed to overlap
http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/
23
NVTX Profile
24
NSIGHT
– CUDA enabled Integrated Development Environment
– Source code editor: syntax highlighting, code refactoring, etc
– Build Manger
– Visual Debugger
– Visual Profiler
– Linux/Macintosh
– Editor = Eclipse
– Debugger = cuda-gdb with a visual wrapper
– Profiler = NVVP
– Windows
– Integrates directly into Visual Studio
– Profiler is NSIGHT VSE
25
Example 4: NSIGHT
Let’s import an existing Makefile project into NSIGHT
Instructions:
1. Run nsight
Select default workspace
2. Click File / New / Makefile Project With
Existing CodeTest
3. Enter Project Name and select the Example15
directory
4. Click Finish
5. Right Click On Project / Properties / Run
Settings / New / C++ Application
6. Browse for Example 4/a.out
7. In Project Explorer double click on main.cu and
explore source
8. Click on the build icon
9. Click on the run icon
10.Click on the profile icon
26
Profiler Summary
– Many profile tools are available
– NVIDIA Provided
– NVPROF: Command Line
– NVVP: Visual profiler
– NSIGHT: IDE (Visual Studio and Eclipse)
– 3rd Party
– TAU
– VAMPIR
27
Optimization
Assess
Deploy Parallelize
Optimize
28
Assess
HOTSPOTS
– Profile the code, find the hotspot(s)
– Focus your attention where it will give the most benefit
29
Parallelize
Applications
Compiler Programming
Libraries
Directives Languages
30
Optimize
Timeline
Guided
System Analysis
31
Bottleneck Analysis
– Don’t assume an optimization was wrong
– Verify if it was wrong with the profiler
129 GB/s 84 GB/s
32
Performance Analysis
84 GB/s 137 GB/s
33
GPU Teaching Kit
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.