KEMBAR78
CUDA | PDF | Graphics Processing Unit | Parallel Computing
0% found this document useful (0 votes)
24 views18 pages

CUDA

CUDA, or Compute Unified Device Architecture, is a parallel computing platform and API developed by Nvidia that allows programming using GPUs for high-speed computations. It organizes computations into threads and blocks, enabling efficient processing of large datasets across various applications such as deep learning and medical imaging. While CUDA offers significant performance advantages, it is limited to NVIDIA hardware and has interoperability constraints with other programming languages like OpenGL.

Uploaded by

kgtlaptop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

CUDA

CUDA, or Compute Unified Device Architecture, is a parallel computing platform and API developed by Nvidia that allows programming using GPUs for high-speed computations. It organizes computations into threads and blocks, enabling efficient processing of large datasets across various applications such as deep learning and medical imaging. While CUDA offers significant performance advantages, it is limited to NVIDIA hardware and has interoperability constraints with other programming languages like OpenGL.

Uploaded by

kgtlaptop
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

COMPUTER UNIFIED DEVICE ARCHITECTURE

-CUDA
INTRODUCTION TO CUDA PROGRAMMING

◼ CUDA stands for Compute Unified Device Architecture.


◼ extension of C/C++ programming.
◼ CUDA is a programming language that uses the Graphical Processing Unit (GPU).
◼ It is a parallel computing platform and an API (Application Programming Interface)
model, Compute Unified Device Architecture was developed by Nvidia.
◼ allows computations to be performed in parallel while providing well-formed speed.
◼ Beyond graphical calculations, CUDA unlocks the potential for processing matrices and
conducting various linear algebra operations efficiently.
WHY DO WE NEED CUDA?

◼ GPUs are designed to perform high-speed parallel computations to display graphics


such as games.
◼ It provides 30-100x speed-up over other microprocessors for some applications.
◼ GPUs have very small Arithmetic Logic Units (ALUs) -for many parallel
calculations, such as calculating the color for each pixel on the screen, etc.
•16 Streaming Multiprocessor (SM) diagrams are shown in the
ARCHITECTURE OF CUDA diagram.
•Each Streaming Multiprocessor has 8 Streaming Processors (SP)
ie, we get a total of 128 Streaming Processors (SPs).
•Now, each Streaming processor has a MAD unit (Multiplication
and Addition Unit) and an additional MU (multiplication unit).
NVIDIA GT200 ARCHITECTURE

The GT200 has 30 Streaming Multiprocessors


(SMs) and each Streaming Multiprocessor (SM)
has 8 Streaming Processors (SPs) ie, a total of 240
Streaming Processors (SPs)
PROCESSING FLOW ON CUDA

Example of CUDA processing flow


1.Copy data from main memory to GPU memory
2.CPU initiates the GPU compute kernel
3.GPU's CUDA cores execute the kernel in parallel
4.Copy the resulting data from GPU memory to main
memory
HOW CUDA WORKS?

◼ Threads and Blocks:


In CUDA, computations are divided
into threads and organized into
blocks.
◼ Threads are individual units of
computation, while blocks group
threads together.
◼ A large number of threads can
work simultaneously within a block,
and multiple blocks can run in
parallel.
◼ i.e GPUs run one kernel at a time.
Each kernel consists of blocks, which
are independent groups of ALUs.The
threads in each block typically work
together to calculate a value. Threads
in the same block can share memory.
◼ Partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each
sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.
◼ A kernel is executed in parallel by an array of threads:
• All threads run the same code.
• Each thread has an ID that it uses to compute memory addresses and make control decisions.

•Different kernels can have different grid/block configuration


•Threads from the same block have access to a shared memory and their
execution can be synchronized
THREAD IDENTITY

◼ The index of a thread and its thread ID relate to each other as follows:
◼ For a 1-dimensional block, the thread index and thread ID are the same
◼ For a 2-dimensional block, the thread index (x,y) has thread ID=x+yDx, for block size (Dx,Dy)
◼ For a 3-dimensional block, the thread index (x,y,z) has thread ID=x+yDx+zDxDy, for block size (Dx,Dy,Dz)
When a kernel is started, the number of blocks per grid and the number of threads per block are fixed
(gridDim and blockDim).
CUDA makes four pieces of information available to each thread:

•The thread index (threadIdx)


•The block index (blockIdx)
•The size a block (blockDim)
•The size of a grid (gridDim)

*each thread in a kernel will compute one element of an array.


For a 2-dimensional grid:
tx = cuda.threadIdx.x
For a 1-dimensional grid:
tx = ty = cuda.threadIdx.y
cuda.threadIdx.x bx bx = cuda.blockIdx.x
= cuda.blockIdx.x by = cuda.blockIdx.y
bw = bw = cuda.blockDim.x
cuda.blockDim.x i = bh = cuda.blockDim.y
tx + bx * bw array[i] x = tx + bx * bw
= compute(i) y = ty + by * bh
array[x, y] = compute(x, y)
MEMORY HIERARCHY

◼ The CPU and GPU have separate memory spaces.


◼ This means that data that is processed by the GPU must be
moved from the CPU to the GPU before the computation
starts, and the results of the computation must be moved
back to the CPU once processing has completed.
Global memory
This memory is accessible to all threads as well as the host
(CPU).
• Global memory is allocated and deallocated by the host
• Used to initialize the data that the GPU will work on
L1/Shared memory
Each thread block has its own shared memory
• Accessible only by threads within the block
• Much faster than local or global memory
• Requires special handling to get maximum performance
• Only exists for the lifetime of the block
Local memory
Each thread has its own private local memory
• Only exists for the lifetime of the thread
• Generally handled automatically by the compiler
L2 cache—The L2 cache is shared across all SMs,
so every thread in every CUDA block can access this memory
Read-only memory—Each SM has an instruction cache,
constant memory, texture memory and RO cache,
which is read-only to kernel code.
Registers—These are private to each thread,
which means that registers assigned to a thread are not visible
to other threads.
PROGRAMMING MODEL

◼ https://docs.nvidia.com/cuda/cuda-c-programming-guide/
◼ Two keywords widely used in CUDA programming
model: host and device.
◼ The host is the CPU available in the system. The system memory
associated with the CPU is called host memory.
◼ The GPU is called a device and GPU memory likewise called device
memory.
To execute any CUDA program, there are three main steps:
• Copy the input data from host memory to device memory, also known as
host-to-device transfer.
• Load the GPU program and execute, caching data on-chip for
performance.
• Copy the results from device memory to host memory, also called
device-to-host transfer.
◼ CUDA kernel and thread hierarchy
◼ Figure 1 shows that the CUDA kernel is a function that gets
executed on GPU.
◼ Every CUDA kernel starts with a __global__ declaration
specifier.
◼ Programmers provide a unique global ID to each thread.
◼ A group of threads is called a CUDA block.
◼ CUDA blocks are grouped into a grid.
◼ kernel is executed as a grid of blocks of threads (Figure 2).
◼ Each CUDA block is executed by one streaming
multiprocessor
(SM) and cannot be migrated to other SMs in GPU
◼ CUDA architecture limits the numbers of threads per block (1024 threads per block limit).

◼ The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.

◼ All threads within a block can be synchronized using an intrinsic function __syncthreads.

◼ The CUDA program for adding two matrices below shows multi-dimensional blockIdx and threadIdx
and other variables like blockDim.

The number of threads per


block and the number of blocks
per grid specified in the
<<<…>>>
CUDA APPLICATIONS
◼ CUDA applications must run parallel operations on large set of data
Applications :
1. Climate, weather, and ocean modeling
2. Data science and analytics
3. Deep learning and machine learning
4. Defence and intelligence
5. Media and entertainment
6. Medical imaging
7. Safety and security
BENEFITS OF CUDA / LIMITATIONS OF CUDA
Benefits of CUDA
◼ There are several advantages that give CUDA an edge over traditional general-purpose graphics processor (GPU)
computers with graphics APIs:
◼ Integrated memory (CUDA 6.0 or later) and Integrated virtual memory (CUDA 4.0 or later).
◼ Shared memory provides a fast area of shared memory for CUDA threads. It can be used as a caching mechanism and
provides more bandwidth than texture lookup.
◼ Scattered read codes can be read from any address in memory.
◼ CUDA has support for bitwise and integer operations.
Limitations of CUD
◼ CUDA source code is given on the host machine or GPU, as defined by the C++ syntax rules. Longstanding versions of
CUDA use C syntax rules, which means that up-to-date CUDA source code may or may not work as required.
◼ CUDA has unilateral interoperability(the ability of computer systems or software to exchange and make use of
information) with transferor languages like OpenGL. OpenGL can access CUDA registered memory, but CUDA cannot
access OpenGL memory.
◼ CUDA supports only NVIDIA hardware.

You might also like