0% found this document useful (0 votes)

33 views21 pages

02 CUDA Shared Memory

The document provides an overview of CUDA shared memory, highlighting the differences between host (CPU) and device (GPU) programming, memory management functions, and the implementation of a 1D stencil operation using shared memory. It discusses the importance of synchronizing threads within a block to avoid data hazards and introduces cooperative groups for efficient thread communication. Additionally, it includes resources for further study and homework assignments for practical application.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views21 pages

02 CUDA Shared Memory

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

CUDA SHARED MEMORY

NVIDIA Corporation
REVIEW (1 OF 2)

Difference between host and device

Host CPU

Device GPU

Using global to declare a function as device code

Executes on the device

Called from the host (or possibly from other device code)

Passing parameters from host code to a device function

2
REVIEW (2 OF 2)

Basic device memory management

cudaMalloc()

cudaMemcpy()
cudaFree()

Launching parallel kernels

Launch N copies of add() with add<<<N,1>>>(…);

Use blockIdx.x to access block index

3
1D STENCIL

Consider applying a 1D stencil to a 1D array of elements

Each output element is the sum of input elements within a radius

If radius is 3, then each output element is the sum of 7 input elements:

radius radius

4
IMPLEMENTING WITHIN A BLOCK

Each thread processes one output element

blockDim.x elements per block

Input elements are read several times

With radius 3, each input element is read seven times

5
SHARING DATA BETWEEN THREADS

Terminology: within a block, threads share data via shared memory

Extremely fast on-chip memory, user-managed

Declare using shared, allocated per block

Data is not visible to threads in other blocks

6
IMPLEMENTING WITH SHARED MEMORY

Cache data in shared memory

Read (blockDim.x + 2 * radius) input elements from global memory to shared memory

Compute blockDim.x output elements

Write blockDim.x output elements to global memory

Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements 7

STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}

8
STENCIL KERNEL
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

9
DATA RACE!
The stencil example will not work…

Suppose thread 15 reads the halo before thread 0 has fetched

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) { Store at temp[18]
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
} Skipped, threadIdx > RADIUS
int result = 0;
result += temp[lindex + 1]; Load from temp[19]

10
__SYNCTHREADS()

void __syncthreads();
Synchronizes all threads within a block

Used to prevent RAW / WAR / WAW hazards

All threads must reach the barrier

In conditional code, the condition must be uniform across the block

11
STENCIL KERNEL
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;

// Read input elements intoStencil

sharedKernel
memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
12
STENCIL KERNEL

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
Stencil Kernel
// Store the result
out[gindex] = result;
}

13
REVIEW

Use shared to declare a variable/array in shared memory

Data is shared between threads in a block

Not visible to threads in other blocks

Use __syncthreads() as a barrier

Use to prevent data hazards

14
LOOKING FORWARD

Cooperative Groups: a flexible model for synchronization and

communication within groups of threads.

DEVELOPERS
At a glance Benefits all applications

Scalable Cooperation among groups of threads Examples include:

Persistent RNNs
Flexible parallel decompositions Physics
Search Algorithms
Sorting
Composition across software boundaries

Deploy Everywhere

15
FOR EXAMPLE: THREAD BLOCK
Implicit group of all the threads in the launched thread block

Implements the same interface as thread_group:

void sync(); // Synchronize the threads in the group

unsigned size(); // Total number of threads in the group

unsigned thread_rank(); // Rank of the calling thread within [0, size)

bool is_valid(); // Whether the group violated any API constraints

And additional thread_block specific functions:

dim3 group_index(); // 3-dimensional block index within the grid

dim3 thread_index(); // 3-dimensional thread index within the block

16
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache

Directed testing: shared in global

Cache: vs shared
Average
Shared 93%
• Easier to use Memory
Benefit
• 90%+ as good
70%

Shared: vs cache
• Faster atomics

• More banks
• More predictable
Pascal Volta
17
FUTURE SESSIONS

CUDA GPU architecture and basic optimizations

Atomics, Reductions, Warp Shuffle

Using Managed Memory

Concurrency (streams, copy/compute overlap, multi-GPU)

Analysis Driven Optimization

Cooperative Groups

18
FURTHER STUDY

Shared memory:

https://devblogs.nvidia.com/using-shared-memory-cuda-cc/

CUDA Programming Guide:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

CUDA Documentation:

https://docs.nvidia.com/cuda/index.html

https://docs.nvidia.com/cuda/cuda-runtime-api/index.html (runtime API)

19
HOMEWORK

Log into Summit (ssh username@home.ccs.ornl.gov -> ssh summit)

Clone GitHub repository:

Git clone git@github.com:olcf/cuda-training-series.git

Follow the instructions in the readme.md file:

https://github.com/olcf/cuda-training-series/blob/master/exercises/hw2/readme.md

Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming

20
QUESTIONS?

CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
CUDA
No ratings yet
CUDA
18 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Class 10
No ratings yet
Class 10
13 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Class 13
No ratings yet
Class 13
19 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
P E O A: Hilippine Agle Ptimization Lgorithm
No ratings yet
P E O A: Hilippine Agle Ptimization Lgorithm
34 pages
Clustering
No ratings yet
Clustering
1 page
1.10. Decision Trees - Scikit-Learn 0.24.1 Documentation
No ratings yet
1.10. Decision Trees - Scikit-Learn 0.24.1 Documentation
10 pages
Loading Pandas
No ratings yet
Loading Pandas
23 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
بارگذاری فایل
No ratings yet
بارگذاری فایل
2 pages
03 Preprocessing
No ratings yet
03 Preprocessing
60 pages
KNN in Python
No ratings yet
KNN in Python
11 pages
08ClassBasic v1
No ratings yet
08ClassBasic v1
46 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
Subdivision
No ratings yet
Subdivision
5 pages
Definition of Big Data - IT Glossary - Gartner
No ratings yet
Definition of Big Data - IT Glossary - Gartner
1 page
In The 1960s and 1970s Dennis Ritchie and Ken Thompson Invented Unix
No ratings yet
In The 1960s and 1970s Dennis Ritchie and Ken Thompson Invented Unix
3 pages
六年级美术评估 worksheet
No ratings yet
六年级美术评估 worksheet
9 pages
USB Dongle Setup for RES2DINV/3DINV
No ratings yet
USB Dongle Setup for RES2DINV/3DINV
1 page
KUKA - BendTechBasic KRC 6.0 EN
No ratings yet
KUKA - BendTechBasic KRC 6.0 EN
119 pages
Naive Bayes for Data Science Students
No ratings yet
Naive Bayes for Data Science Students
1,652 pages
Microcontroller Basics Course
No ratings yet
Microcontroller Basics Course
5 pages
M514 M516 AU400: EL Hardware Manual Rev. 0700
No ratings yet
M514 M516 AU400: EL Hardware Manual Rev. 0700
15 pages
Pic
100% (1)
Pic
71 pages
F02 - AC02 - Complex Number Impedance Admttance and Power For AC Circuit
No ratings yet
F02 - AC02 - Complex Number Impedance Admttance and Power For AC Circuit
70 pages
Mark Sheet
No ratings yet
Mark Sheet
1 page
SMD Meter: User 'S Manual
No ratings yet
SMD Meter: User 'S Manual
2 pages
Change Impact Analysis Checklist
No ratings yet
Change Impact Analysis Checklist
4 pages
Three Address Codes
No ratings yet
Three Address Codes
5 pages
Report On Martingale Theory
No ratings yet
Report On Martingale Theory
13 pages
Final Project New
100% (1)
Final Project New
127 pages
Accredited Diagnostic Clinics 2012
No ratings yet
Accredited Diagnostic Clinics 2012
2 pages
Elliott Wave Pattern Recognition Scanner
No ratings yet
Elliott Wave Pattern Recognition Scanner
5 pages
LVS Verification Process Guide
No ratings yet
LVS Verification Process Guide
11 pages
Wireshark Lab 1.2 Import and Examine PCAP File (V1.1)
No ratings yet
Wireshark Lab 1.2 Import and Examine PCAP File (V1.1)
9 pages
Ebill 13072638909
No ratings yet
Ebill 13072638909
6 pages
Sahilmahajan (4 6)
No ratings yet
Sahilmahajan (4 6)
3 pages
Audi SSP 364 Audi q7 Electrical System
100% (67)
Audi SSP 364 Audi q7 Electrical System
6 pages
Hays 1808039
No ratings yet
Hays 1808039
124 pages
Cashpower Meter Setup Guide
0% (1)
Cashpower Meter Setup Guide
2 pages
Curriculum Vitae: Career Objective
No ratings yet
Curriculum Vitae: Career Objective
3 pages
Industrial Network Switch Guide
No ratings yet
Industrial Network Switch Guide
6 pages
Paul Szymanski Resume 2014 and References-Prime
No ratings yet
Paul Szymanski Resume 2014 and References-Prime
13 pages
Elementary Level Test: 1. Open The Brackets Using The Correct Form of The Verb: Present Simple/continuous
No ratings yet
Elementary Level Test: 1. Open The Brackets Using The Correct Form of The Verb: Present Simple/continuous
2 pages
Calculator Techniques For Solving Progression Problems
No ratings yet
Calculator Techniques For Solving Progression Problems
6 pages

02 CUDA Shared Memory

Uploaded by

02 CUDA Shared Memory

Uploaded by

CUDA SHARED MEMORY

Difference between host and device

Using __global__ to declare a function as device code

Executes on the device

Passing parameters from host code to a device function

Basic device memory management

Launching parallel kernels

Launch N copies of add() with add<<<N,1>>>(…);

Consider applying a 1D stencil to a 1D array of elements

Each output element is the sum of input elements within a radius

If radius is 3, then each output element is the sum of 7 input elements:

Each thread processes one output element

blockDim.x elements per block

Input elements are read several times

With radius 3, each input element is read seven times

Terminology: within a block, threads share data via shared memory

Extremely fast on-chip memory, user-managed

Declare using __shared__, allocated per block

Data is not visible to threads in other blocks

Cache data in shared memory

Compute blockDim.x output elements

Each block needs a halo of radius elements at each boundary

halo on left halo on right

blockDim.x output elements 7

// Read input elements into shared memory

// Store the result

Suppose thread 15 reads the halo before thread 0 has fetched

Used to prevent RAW / WAR / WAW hazards

All threads must reach the barrier

In conditional code, the condition must be uniform across the block

// Read input elements intoStencil

// Apply the stencil

Use __shared__ to declare a variable/array in shared memory

Data is shared between threads in a block

Not visible to threads in other blocks

Use __syncthreads() as a barrier

Cooperative Groups: a flexible model for synchronization and

Scalable Cooperation among groups of threads Examples include:

Implements the same interface as thread_group:

void sync(); // Synchronize the threads in the group

unsigned size(); // Total number of threads in the group

unsigned thread_rank(); // Rank of the calling thread within [0, size)

bool is_valid(); // Whether the group violated any API constraints

And additional thread_block specific functions:

dim3 group_index(); // 3-dimensional block index within the grid

dim3 thread_index(); // 3-dimensional thread index within the block

Directed testing: shared in global

CUDA GPU architecture and basic optimizations

Atomics, Reductions, Warp Shuffle

Using Managed Memory

Concurrency (streams, copy/compute overlap, multi-GPU)

Analysis Driven Optimization

CUDA Programming Guide:

https://docs.nvidia.com/cuda/cuda-runtime-api/index.html (runtime API)

Log into Summit (ssh username@home.ccs.ornl.gov -> ssh summit)

Clone GitHub repository:

Git clone git@github.com:olcf/cuda-training-series.git

Follow the instructions in the readme.md file:

You might also like

Using global to declare a function as device code

Declare using shared, allocated per block

Use shared to declare a variable/array in shared memory