CUDA SHARED MEMORY
NVIDIA Corporation
REVIEW (1 OF 2)
Difference between host and device
Host CPU
Device GPU
Using __global__ to declare a function as device code
Executes on the device
Called from the host (or possibly from other device code)
Passing parameters from host code to a device function
2
REVIEW (2 OF 2)
Basic device memory management
cudaMalloc()
cudaMemcpy()
cudaFree()
Launching parallel kernels
Launch N copies of add() with add<<<N,1>>>(…);
Use blockIdx.x to access block index
3
1D STENCIL
Consider applying a 1D stencil to a 1D array of elements
Each output element is the sum of input elements within a radius
If radius is 3, then each output element is the sum of 7 input elements:
radius radius
4
IMPLEMENTING WITHIN A BLOCK
Each thread processes one output element
blockDim.x elements per block
Input elements are read several times
With radius 3, each input element is read seven times
5
SHARING DATA BETWEEN THREADS
Terminology: within a block, threads share data via shared memory
Extremely fast on-chip memory, user-managed
Declare using __shared__, allocated per block
Data is not visible to threads in other blocks
6
IMPLEMENTING WITH SHARED MEMORY
Cache data in shared memory
Read (blockDim.x + 2 * radius) input elements from global memory to shared memory
Compute blockDim.x output elements
Write blockDim.x output elements to global memory
Each block needs a halo of radius elements at each boundary
halo on left halo on right
blockDim.x output elements 7
STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
8
STENCIL KERNEL
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
9
DATA RACE!
The stencil example will not work…
Suppose thread 15 reads the halo before thread 0 has fetched
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) { Store at temp[18]
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
} Skipped, threadIdx > RADIUS
int result = 0;
result += temp[lindex + 1]; Load from temp[19]
10
__SYNCTHREADS()
void __syncthreads();
Synchronizes all threads within a block
Used to prevent RAW / WAR / WAW hazards
All threads must reach the barrier
In conditional code, the condition must be uniform across the block
11
STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
// Read input elements intoStencil
sharedKernel
memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
12
STENCIL KERNEL
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
Stencil Kernel
// Store the result
out[gindex] = result;
}
13
REVIEW
Use __shared__ to declare a variable/array in shared memory
Data is shared between threads in a block
Not visible to threads in other blocks
Use __syncthreads() as a barrier
Use to prevent data hazards
14
LOOKING FORWARD
Cooperative Groups: a flexible model for synchronization and
communication within groups of threads.
DEVELOPERS
At a glance Benefits all applications
Scalable Cooperation among groups of threads Examples include:
Persistent RNNs
Flexible parallel decompositions Physics
Search Algorithms
Sorting
Composition across software boundaries
Deploy Everywhere
15
FOR EXAMPLE: THREAD BLOCK
Implicit group of all the threads in the launched thread block
Implements the same interface as thread_group:
void sync(); // Synchronize the threads in the group
unsigned size(); // Total number of threads in the group
unsigned thread_rank(); // Rank of the calling thread within [0, size)
bool is_valid(); // Whether the group violated any API constraints
And additional thread_block specific functions:
dim3 group_index(); // 3-dimensional block index within the grid
dim3 thread_index(); // 3-dimensional thread index within the block
16
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache
Directed testing: shared in global
Cache: vs shared
Average
Shared 93%
• Easier to use Memory
Benefit
• 90%+ as good
70%
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Pascal Volta
17
FUTURE SESSIONS
CUDA GPU architecture and basic optimizations
Atomics, Reductions, Warp Shuffle
Using Managed Memory
Concurrency (streams, copy/compute overlap, multi-GPU)
Analysis Driven Optimization
Cooperative Groups
18
FURTHER STUDY
Shared memory:
https://devblogs.nvidia.com/using-shared-memory-cuda-cc/
CUDA Programming Guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory
CUDA Documentation:
https://docs.nvidia.com/cuda/index.html
https://docs.nvidia.com/cuda/cuda-runtime-api/index.html (runtime API)
19
HOMEWORK
Log into Summit (ssh username@home.ccs.ornl.gov -> ssh summit)
Clone GitHub repository:
Git clone git@github.com:olcf/cuda-training-series.git
Follow the instructions in the readme.md file:
https://github.com/olcf/cuda-training-series/blob/master/exercises/hw2/readme.md
Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming
20
QUESTIONS?