0% found this document useful (0 votes)

29 views50 pages

CUDA Introduction Mod

The document provides an introduction to CUDA, a parallel computing platform and API by Nvidia designed for general-purpose computing on GPUs. It explains the differences between CPUs and GPUs, outlines the hardware and software requirements for CUDA programming, and details the process of memory allocation and execution of kernel functions on the GPU. Additionally, it includes examples of CUDA code and performance tips for optimizing block and grid sizes.

Uploaded by

siranjiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views50 pages

CUDA Introduction Mod

Uploaded by

siranjiv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

CUDA: Introduction

Christian Trefftz / Greg Wolffe

Grand Valley State University
Supercomputing 2008
Education Program
(modifications by Jernej Barbic, 2008-2019)
Terms
Ø What is GPGPU?
l General-Purpose computing on a Graphics

Processing Unit
l Using graphic hardware for non-graphic

computations

Ø What is CUDA?
l Parallel computing platform and API by Nvidia

l Compute Unified Device Architecture

l Software architecture for managing data-parallel

programming
l Introduced in 2007; still actively updated 2
Motivation

3
Motivation

4
Motivation

5
CPU vs. GPU
Ø CPU
l Fast caches
l Branching adaptability
l High performance
Ø GPU
l Multiple ALUs
l Fast onboard memory
l High throughput on parallel tasks
• Executes program on each fragment/vertex

Ø CPUs are great for task parallelism

Ø GPUs are great for data parallelism

6
CPU vs. GPU - Hardware

Ø More transistors devoted to data processing

7
Traditional Graphics Pipeline
Vertex processing
ò
Rasterizer
ò
Fragment processing
ò
Renderer (textures)

8
Pixel / Thread Processing

9
GPU Architecture

10
Processing Element

Ø Processing element = thread processor

11
GPU Memory Architecture
Uncached:
Ø Registers
Ø Shared Memory
Ø Local Memory
Ø Global Memory

Cached:
Ø Constant Memory
Ø Texture Memory

12
Data-parallel Programming
Ø Think of the GPU as a massively-threaded
co-processor
Ø Write “kernel” functions that execute on
the device -- processing multiple data
elements in parallel

Ø Keep it busy! [ massive threading

Ø Keep your data close! [ local memory

13
Hardware Requirements
Ø CUDA-capable
video card
Ø Power supply
Ø Cooling
Ø PCI-Express

14
A Gentle Introduction to
CUDA Programming

17
Credits
Ø Thecode used in this presentation is based
on code available in:
l the Tutorial on CUDA in Dr. Dobbs Journal
l Andrew Bellenir’s code for matrix multiplication
l Igor Majdandzic’s code for Voronoi diagrams
l NVIDIA’s CUDA programming guide

18
Software Requirements/Tools

Ø CUDA device driver

Ø CUDA Toolkit (compiler, CUBLAS, CUFFT)
Ø CUDA Software Development Kit
l Emulator

Profiling:
Ø Occupancy calculator
Ø Visual profiler

19
To compute, we need to:
Ø Allocate memory for the computation
on the GPU (incl. variables)
Ø Provide input data
Ø Specify the computation to be performed
Ø Read the results from the GPU (output)

20
Initially:

array

CPU Memory GPU Card’s Memory

21
Allocate Memory in the GPU
card

array array_d

Host’s Memory GPU Card’s Memory

22
Copy content from the host’s memory to the
GPU card memory

array array_d

Host’s Memory GPU Card’s Memory

23
Execute code on the GPU

GPU MPs

array array_d

Host’s Memory GPU Card’s Memory

24
Copy results back to the host
memory

array array_d

Host’s Memory GPU Card’s Memory

25
The Kernel
Ø The code to be executed in the
stream processors on the GPU

Ø Simultaneous execution in
several (perhaps all) stream
processors on the GPU

Ø How is every instance of the

kernel going to know which
piece of data it is working on?

26
Grid and Block Size

l Grid size: The number of blocks

• Can be 1 or 2-dimensional array of blocks

l Each block is divided into threads

• Can be 1, 2, or 3-dimensional array of threads

27
Let’s look at a very simple example
Ø The code has been divided into two files:
l simple.c
l simple.cu
Ø simple.c is ordinary code in C
Ø It allocates an array of integers, initializes
it to values corresponding to the indices in
the array and prints the array.
Ø It calls a function that modifies the array
Ø The array is printed again.

28
simple.c
Ø
#include <stdio.h>
#define SIZEOFARRAY 64
extern void fillArray(int *a,int size);
/* The main program */
int main(int argc,char *argv[])
{
/* Declare the array that will be modified by the GPU */
int a[SIZEOFARRAY];
int i;
/* Initialize the array to 0s */
for(i=0;i < SIZEOFARRAY;i++) {
a[i]=0;
}
/* Print the initial array */
printf("Initial state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
/* Call the function that will in turn call the function in the GPU that will fill
the array */
fillArray(a,SIZEOFARRAY);
/* Now print the array after calling fillArray */
printf("Final state of the array:\n");
for(i = 0;i < SIZEOFARRAY;i++) {
printf("%d ",a[i]);
}
printf("\n");
return 0;
}

29
simple.cu
Ø simple.cu contains two functions
l fillArray(): A function that will be executed on
the host and which takes care of:
• Allocating variables in the global GPU memory
• Copying the array from the host to the GPU memory
• Setting the grid and block sizes
• Invoking the kernel that is executed on the GPU
• Copying the values back to the host memory
• Freeing the GPU memory

30
fillArray (part 1)
#define BLOCK_SIZE 32
extern "C" void fillArray(int *array, int arraySize)
{
int * array_d;
cudaError_t result;

/* cudaMalloc allocates space in GPU memory */

result =
cudaMalloc((void**)&array_d,sizeof(int)*arraySize);

/* copy the CPU array into the GPU array_d */

result = cudaMemcpy(array_d,array,sizeof(int)*arraySize,
cudaMemcpyHostToDevice);

31
fillArray (part 2)
/* Indicate block size */
dim3 dimblock(BLOCK_SIZE);
/* Indicate grid size */
dim3 dimgrid(arraySize / BLOCK_SIZE);

/* Call the kernel */

cu_fillArray<<<dimgrid,dimblock>>>(array_d);

/* Copy the results from GPU back to CPU memory */

result =
cudaMemcpy(array,array_d,sizeof(int)*arraySize,cudaMemcpyDevice
ToHost);

/* Release the GPU memory */

cudaFree(array_d);
}

32
simple.cu (cont.)
Ø The other function in simple.cu is cu_fillArray():

l This is the GPU kernel

l Identified by the keyword: global

l Built-in variables:
• blockIdx.x : block index within the grid
• threadIdx.x: thread index within the block

33
cu_fillArray
__global__ void cu_fillArray(int * array_d)
{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d[x] = x;
}

global void cu_addIntegers(int * array_d1, int * array_d2)

{
int x;
x = blockIdx.x * BLOCK_SIZE + threadIdx.x;
array_d1[x] += array_d2[x];
}

34
To compile:
Ø nvcc simple.c simple.cu –o simple
Ø The compiler generates the code for both
the host and the GPU
Ø Demo on cuda.littlefe.net …

35
In the GPU:

Processing Elements

Thread Thread Thread Thread Thread Thread Thread Thread

0 1 2 3 0 1 2 3

Array Elements
Block 0 Block 1

37
Another Example: saxpy
Ø SAXPY (Scalar Alpha X Plus Y)
l A common operation in linear algebra
Ø CUDA: loop iteration ð thread

41
Traditional Sequential Code
void saxpy_serial(int n,
float alpha,
float *x,
float *y)
{
for(int i = 0;i < n;i++)
y[i] = alpha*x[i] + y[i];
}

42
CUDA Code
__global__ void saxpy_parallel(int n,
float alpha,
float *x,
float *y) {
int i = blockIdx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i] = alpha*x[i] + y[i];
}

43
“Warps”
Ø Each block is split into SIMD groups of threads
called "warps".

Ø Each warp contains the same number of threads,

called the "warp size”

44
warp 1

warp 2 Block 1

warp 3
threads

warp 1
Block 2

warp 2

warp 3

warp 1
Block 3

warp 2

warp 3
Multi-processor 1

warp 1
Block 4

warp 2

warp 3
45
Keeping multiprocessors in mind…
Ø Each multiprocessor can process multiple blocks at a
time.

Ø How many depends on the number of registers per

thread and how much shared memory per block is
required by a given kernel.

Ø If a block is too large, it will not fit into the resources of

an MP.

46
Performance Tip: Block Size

Ø Critical for performance

Ø Recommended value is 192 or 256
Ø Maximum value is 512
Ø Should be a multiple of 32 since this is the warp
size for Series 8 GPUs and thus the native
execution size for multiprocessors
Ø Limited by number of registers on the MP
Ø Series 8 GPU MPs have 8192 registers which
are shared between all the threads on an MP

47
Performance Tip:
Grid Size (number of blocks)
Ø Recommended value is at least 100, but 1000 would
scale for many generations of hardware

Ø Actual value depends on problem size

Ø It should be a multiple of the number of MPs for an even

distribution of work (not a requirement though)

Ø Example: 24 blocks
l Grid will work efficiently on Series 8 (12 MPs), but it will waste
resources on new GPUs with 32MPs

48
Example: Tesla P100
Ø Launched in 2016
Ø “Pascal” architecture (successors: Volta, Turing)
Ø Double-precision performance: 4.7 TeraFLOPS
Ø Single-precision performance: 9.3 TeraFLOPS
Ø GPU Memory: 16 GB

49
Example: Tesla P100
Ø Number of Multiprocessors (MPs): 56
Ø Number of Cuda Cores per MP: 64
Ø Total number of Cuda Cores: 3584
Ø #Cuda Cores = #number of floating point
instructions that can be processed per cycle
Ø MPs can run multiple threads per core
simultaneously (similar to hyperthreading on CPU)
Ø Hence, #threads can be larger than #cores

50
Memory Alignment
Ø Memory access faster if data aligned at 64
byte boundaries

Ø Hence, allocate 2D arrays so that every

row starts at a 64-byte boundary

Ø Tedious for a programmer

51
Allocating 2D arrays with “pitch”
Ø CUDA offers special versions of:

l Memory allocation of 2D arrays so that every row

is padded (if necessary): cudaMallocPitch()

l Memory copy operations that take into account the

pitch: cudaMemcpy2D()

52
Pitch
Columns

Padding

Rows

Pitch

53
Dividing the work by blocks:
Columns

Block 0

Rows Block 1

Block 2

Pitch

60
Watchdog timer
Ø OS may force programs using the GPU to time out if
running too long

Ø Exceeding the limit can cause CUDA program

failure.

Ø Possible solution: run CUDA on a GPU that is NOT

attached to a display.

65
Resources on line
Ø http://www.acmqueue.org/modules.php?name=
Content&pa=showpage&pid=532
Ø http://www.ddj.com/hpc-high-performance-
computing/207200659
Ø http://www.nvidia.com/object/cuda_home.html#
Ø http://www.nvidia.com/object/cuda_learn.html
Ø “Computation of Voronoi diagrams using a
graphics processing unit” by Igor Majdandzic et
al. available through IEEE Digital Library, DOI:
10.1109/EIT.2008.4554342

Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA
No ratings yet
CUDA
18 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Threads
No ratings yet
Threads
54 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
IL230x-B110 Fieldbus Box Modules For EtherCAT
No ratings yet
IL230x-B110 Fieldbus Box Modules For EtherCAT
2 pages
Line Follower Robot Thesis
100% (2)
Line Follower Robot Thesis
5 pages
Product Prospectus - Tharaldsen
No ratings yet
Product Prospectus - Tharaldsen
2 pages
Online Banking Security Measures and Data Protection Advances in Information Security Privacy and Ethics 1st Edition Shadi A. Aljawarneh
100% (5)
Online Banking Security Measures and Data Protection Advances in Information Security Privacy and Ethics 1st Edition Shadi A. Aljawarneh
55 pages
TTL 2-MATH: Lesson 5
No ratings yet
TTL 2-MATH: Lesson 5
3 pages
10th IT - Sample Paper
No ratings yet
10th IT - Sample Paper
5 pages
A Taxonomy and Survey of Edge Cloud Computing For Intelligent Transportation Systems and Connected Vehicles
No ratings yet
A Taxonomy and Survey of Edge Cloud Computing For Intelligent Transportation Systems and Connected Vehicles
16 pages
AdvancedMobileApplications 11.2.2
No ratings yet
AdvancedMobileApplications 11.2.2
128 pages
Hyundai Hd2600 Hd3100 Turningcenter
No ratings yet
Hyundai Hd2600 Hd3100 Turningcenter
36 pages
Old Exam
No ratings yet
Old Exam
104 pages
VHDL Data Types & Usage Guide
No ratings yet
VHDL Data Types & Usage Guide
70 pages
Grade 10 - Algebra Worksheet 2020
100% (1)
Grade 10 - Algebra Worksheet 2020
15 pages
Afs Andrew File System
No ratings yet
Afs Andrew File System
28 pages
Indicator SNR + ICT HTF Candles With FVG For TradingView
100% (1)
Indicator SNR + ICT HTF Candles With FVG For TradingView
23 pages
Autelan Daily Report RAFI WAC 18 April 2023 Shift 1-08-00 20
No ratings yet
Autelan Daily Report RAFI WAC 18 April 2023 Shift 1-08-00 20
7 pages
CYBER U 2 ONE SHOT NOTES 56433900 c2b2 4da7 977a 57dcaa0732fa
No ratings yet
CYBER U 2 ONE SHOT NOTES 56433900 c2b2 4da7 977a 57dcaa0732fa
31 pages
HKICO 2019-2020 Mock Final Blocky
50% (2)
HKICO 2019-2020 Mock Final Blocky
8 pages
Mvh-x565bt Mvh-X465ui Owner Manual Qrd3243a
No ratings yet
Mvh-x565bt Mvh-X465ui Owner Manual Qrd3243a
140 pages
The J2ME Architecture Comprises Three Software Layers
No ratings yet
The J2ME Architecture Comprises Three Software Layers
1 page
RCM in Nuclear Power Plants
No ratings yet
RCM in Nuclear Power Plants
13 pages
Arduino & Excel Integration Guide
100% (1)
Arduino & Excel Integration Guide
11 pages
MC Unit 03
No ratings yet
MC Unit 03
13 pages
RIMS ISACA Bridging The Digital Risk Gap - Res - Eng - 0919
No ratings yet
RIMS ISACA Bridging The Digital Risk Gap - Res - Eng - 0919
18 pages
About Whatsapp Lock Chat (By Vikas Verma)
No ratings yet
About Whatsapp Lock Chat (By Vikas Verma)
3 pages
Cloud Tutorial
No ratings yet
Cloud Tutorial
8 pages
Factors, Primes and Multiples
No ratings yet
Factors, Primes and Multiples
7 pages
Year 4 Math Test
No ratings yet
Year 4 Math Test
11 pages
Chapter Six Tree and Graph
No ratings yet
Chapter Six Tree and Graph
26 pages
Distributed Computing Full Assignment
No ratings yet
Distributed Computing Full Assignment
4 pages
1cs6bbcsg 694504
No ratings yet
1cs6bbcsg 694504
11 pages

CUDA Introduction Mod

Uploaded by

CUDA Introduction Mod

Uploaded by

CUDA: Introduction

Christian Trefftz / Greg Wolffe

l Compute Unified Device Architecture

l Software architecture for managing data-parallel

Ø CPUs are great for task parallelism

Ø More transistors devoted to data processing

Ø Processing element = thread processor

Ø Keep it busy! [ massive threading

Ø CUDA device driver

CPU Memory GPU Card’s Memory

Host’s Memory GPU Card’s Memory

Host’s Memory GPU Card’s Memory

Host’s Memory GPU Card’s Memory

Host’s Memory GPU Card’s Memory

Ø How is every instance of the

l Grid size: The number of blocks

l Each block is divided into threads

/* cudaMalloc allocates space in GPU memory */

/* copy the CPU array into the GPU array_d */

/* Call the kernel */

/* Copy the results from GPU back to CPU memory */

/* Release the GPU memory */

l This is the GPU kernel

l Identified by the keyword: __global__

__global__ void cu_addIntegers(int * array_d1, int * array_d2)

Thread Thread Thread Thread Thread Thread Thread Thread

Ø Each warp contains the same number of threads,

Ø How many depends on the number of registers per

Ø If a block is too large, it will not fit into the resources of

Ø Critical for performance

Ø Actual value depends on problem size

Ø It should be a multiple of the number of MPs for an even

Ø Hence, allocate 2D arrays so that every

Ø Tedious for a programmer

l Memory allocation of 2D arrays so that every row

l Memory copy operations that take into account the

Ø Exceeding the limit can cause CUDA program

Ø Possible solution: run CUDA on a GPU that is NOT

You might also like

l Identified by the keyword: global

global void cu_addIntegers(int * array_d1, int * array_d2)