6 Computation

The document discusses CUDA programming concepts, focusing on kernel functions, thread divergence, and memory models. It includes code snippets illustrating thread execution and divergence issues, as well as an overview of different types of memory in CUDA, such as global, texture, and constant memory. The content emphasizes the importance of optimizing code to reduce thread divergence and improve performance in GPU computing.

Uploaded by

webbstu1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

6 Computation

Uploaded by

webbstu1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

CUDA Programming

Recap
__global__
__global__voidvoiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
int
intidid==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x; S0
ifif(id
(id%%2)2)vector[id] id; S1
vector[id]==id;
else
elsevector[id]
vector[id]==vectorsize
vectorsize**vectorsize;
vectorsize; S2
vector[id]++;
vector[id]++; S3
}}

0 1 2 3 4 5 6 7

S0 S0 S0 S0 S0 S0 S0 S0 NOP

S1 S1 S1 S1
Time

S2 S2 S2 S2

S3 S3 S3 S3 S3 S3 S3 S3 2
Classwork
●
Rewrite the following program fragment to
remove thread-divergence.
assert(x
assert(x== ==yy||||xx==
==zz||||xx==
==ww););
ifif(x
(x==
==y)y)xx==zz++w;
w;
else
elseif(
if(xx==
==zz))xx==ww++y;y;
else
elsexx==yy++z;z;

assert(x
assert(x====yy||||xx==
==zz||||xx==
==ww););
xx==yy++zz++ww––x;x;

3
Classwork
●
How many steps does warp threads take to
execute?
__global__
__global__voidvoiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
int
intid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
ifif((id
id<=
<=00 ){){
vector[id]
vector[id]==0;0;
for
for(int
(inti=1;i<=100;i++){
i=1;i<=100;i++){
vector[id]
vector[id]+=
+=i;i;
}}
}}
else{
else{
vector[id]
vector[id]==1;1;
}}
}} 4
Classwork
●
How many steps does warp threads take to
execute?
__global__
__global__voidvoiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
int
intid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
ifif((id
id<=
<=00 ){){
vector[id]
vector[id]==(101*100)
(101*100)//2;2;
}}
else{
else{
vector[id]
vector[id]==1;1;
}}
}}

5
Classwork
●
How many steps does warp threads take to
execute?
__global__
__global__void
voiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
int
intid
id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
vector[id]
vector[id]==((11++(((-id)>>31)
(-id)>>31)))**((((101*100)
((101*100)//2)
2)--11))++11; ;
}}

6
Thread-Divergence
__global__
__global__void
voiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
unsigned
unsignedid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
switch
switch(id)
(id){{
case
case0:0:vector[id]
vector[id]==0;0; break;
break;
case
case1:1:vector[id]
vector[id]==vector[id];
vector[id]; break;
break;
case
case2:2:vector[id]
vector[id]==vector[id
vector[id--2];
2]; break;
break;
case
case3:3:vector[id]
vector[id]==vector[id
vector[id++3];
3]; break;
break;
case
case4:4:vector[id]
vector[id]==44++44++vector[id];
vector[id]; break;
break;
case
case5:5:vector[id]
vector[id]==55--vector[id];
vector[id]; break;
break;
case
case6:6:vector[id]
vector[id]==vector[6];
vector[6]; break;
break;
case
case7:7:vector[id]
vector[id]==77++7;7; break;
break;
case
case8:8:vector[id]
vector[id]==vector[id]
vector[id]++8;8; break;
break;
case
case9:9:vector[id]
vector[id]==vector[id]
vector[id]**9;9; break;
break;
}}
}}
How
How many
many steps
steps will
will the
the warp
warp threads
threads take?
take?
7
Thread-Divergence
__global__
__global__voidvoiddkernel()
dkernel()
{{
ifif(threadidx.x
(threadidx.x<16)
<16)
{{
printf(“Inside
printf(“InsideIf”);
If”);
Global_Barrier();
Global_Barrier();
}}
else
elseifif(threadidx
(threadidx>=16)
>=16)
{{
printf(“Inside
printf(“Insideelse”);
else”);
Global_Barrier();
Global_Barrier();
}}
}}

What
What is
is the
the Output?
Output?
Deadlock!!
Deadlock!! 8
Memory
Agenda
●
Computation
●
Memory
●
Synchronization
●
Functions
●
Support
●
Topics

10
CUDA Memory Model Overview
• Global / Video memory
– Main means of communicating data
Grid
between host and device
Block (0, 0)‫‏‬ Block (1, 0)‫‏‬
– Contents visible to all GPU threads
– Long latency access (400-800 cycles) Shared Memory Shared Memory

– Throughput ~200 GBPS Registers Registers Registers Registers

• Texture Memory Thread (0, 0)‫‏‬ Thread (1, 0)‫‏‬ Thread (0, 0)‫‏‬ Thread (1, 0)‫‏‬

– Read-only (12 KB)

– ~800 GBPS Host Global Memory

– Optimized for 2D spatial locality

• Constant Memory
– Read-only (64 KB)
1111
The numbers are typical values.

5 Computation
No ratings yet
5 Computation
13 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
Processors
No ratings yet
Processors
25 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
Coursera Quiz Week2 Fall 2012
No ratings yet
Coursera Quiz Week2 Fall 2012
3 pages
CUDA Programming Quiz
100% (5)
CUDA Programming Quiz
4 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
CUDA Thread Divergence Costs
No ratings yet
CUDA Thread Divergence Costs
8 pages
Performance
No ratings yet
Performance
51 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Answer: (C)
100% (1)
Answer: (C)
3 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Advanced CUDA Programming Guide
No ratings yet
Advanced CUDA Programming Guide
64 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Case Study On GPU Architectures: Lecture 3H
No ratings yet
Case Study On GPU Architectures: Lecture 3H
34 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
CUDA Matrix Multiplication Quiz
No ratings yet
CUDA Matrix Multiplication Quiz
12 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
HPC File
No ratings yet
HPC File
22 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
L06 GPGPU CUDA Programming 1
No ratings yet
L06 GPGPU CUDA Programming 1
23 pages
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
No ratings yet
CSE 599 I Accelerated Computing - Programming GPUs Lecture 15
42 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
HPC
No ratings yet
HPC
90 pages
CUDA Class Lecture04
No ratings yet
CUDA Class Lecture04
11 pages
CUDA Part-2
No ratings yet
CUDA Part-2
61 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Class 10
No ratings yet
Class 10
13 pages
Section 1
No ratings yet
Section 1
4 pages
周04
No ratings yet
周04
46 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
No ratings yet
2023 CSC14120 Lecture04 CUDAParallelExecution (P2)
24 pages
GPU Assignment-3 Solution
No ratings yet
GPU Assignment-3 Solution
4 pages
Hpca2021 Gpu 1
No ratings yet
Hpca2021 Gpu 1
45 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Less Slow C++ - Hacker News
No ratings yet
Less Slow C++ - Hacker News
3 pages
Nowedif Dep Call 2025
No ratings yet
Nowedif Dep Call 2025
20 pages
Differentiated: Multi-Step Equations Scavenger Hunt
No ratings yet
Differentiated: Multi-Step Equations Scavenger Hunt
16 pages
AERO462 - Outline - F2020
No ratings yet
AERO462 - Outline - F2020
8 pages
X500 - 250 Cap 01 (Info Generali) PDF
No ratings yet
X500 - 250 Cap 01 (Info Generali) PDF
44 pages
Glocalisation
No ratings yet
Glocalisation
6 pages
Sales Tax Act 1990
No ratings yet
Sales Tax Act 1990
250 pages
Relevant Costing & Decision Making
No ratings yet
Relevant Costing & Decision Making
11 pages
Service: Golf 2004 Golf Plus 2005 Passat 2006 Touran 2003
100% (1)
Service: Golf 2004 Golf Plus 2005 Passat 2006 Touran 2003
299 pages
Attachment-7-Operation and Maintenance Training PDF
No ratings yet
Attachment-7-Operation and Maintenance Training PDF
4 pages
Workforce Diversity A Key To Improve Productivity 2014 Procedia Economics and Finance
No ratings yet
Workforce Diversity A Key To Improve Productivity 2014 Procedia Economics and Finance
10 pages
Content of HPE ESXi Release Images
No ratings yet
Content of HPE ESXi Release Images
28 pages
Ajsr 7 09
No ratings yet
Ajsr 7 09
13 pages
College - Principal - Names & College Email Id - Mumbai Oly
No ratings yet
College - Principal - Names & College Email Id - Mumbai Oly
19 pages
1987 - A Versatile Graph Structure For Edge-Oriented Graph Algorithms (Ebert1987AVD)
No ratings yet
1987 - A Versatile Graph Structure For Edge-Oriented Graph Algorithms (Ebert1987AVD)
7 pages
S.No. Name Town Address1 Numbe R Mobile - N O Email
No ratings yet
S.No. Name Town Address1 Numbe R Mobile - N O Email
21 pages
Activity Sheet - Operating Systems
67% (3)
Activity Sheet - Operating Systems
2 pages
Andrino Alexandra Resume
No ratings yet
Andrino Alexandra Resume
1 page
Arduino Course Final Exam
60% (10)
Arduino Course Final Exam
6 pages
Eim 7 - 8 Q4 Module 11
No ratings yet
Eim 7 - 8 Q4 Module 11
22 pages
SM Project Final
No ratings yet
SM Project Final
17 pages
Specifications and Repair Procedures For C4.4 Cylinder Blocks
No ratings yet
Specifications and Repair Procedures For C4.4 Cylinder Blocks
8 pages
Project - 2022 Petrobel Final
No ratings yet
Project - 2022 Petrobel Final
456 pages
5 Garc
No ratings yet
5 Garc
16 pages
3 War of Independence 1857
No ratings yet
3 War of Independence 1857
3 pages
KPI Indicators Example
No ratings yet
KPI Indicators Example
3 pages
Legal and Ethical Issues
No ratings yet
Legal and Ethical Issues
10 pages
Ohlins - Europe - Benutzerhandbuch-Oehlins-Mtb-Federgabel-Rxf38 2
No ratings yet
Ohlins - Europe - Benutzerhandbuch-Oehlins-Mtb-Federgabel-Rxf38 2
16 pages
ACTREC Staff List
No ratings yet
ACTREC Staff List
23 pages
Getting Started For non-US Investors - Bogleheads
No ratings yet
Getting Started For non-US Investors - Bogleheads
2 pages
GIMP Graffiti Guide for Beginners
No ratings yet
GIMP Graffiti Guide for Beginners
6 pages

6 Computation

Uploaded by

6 Computation

Uploaded by

CUDA Programming

– Throughput ~200 GBPS Registers Registers Registers Registers

– Read-only (12 KB)

– Optimized for 2D spatial locality

You might also like