0% found this document useful (0 votes)

83 views92 pages

Kirk+Hwu GPU

This document discusses programming massively parallel processors. It begins by noting the large performance advantages of GPUs compared to CPUs in terms of calculation speed, memory bandwidth, and the fact that GPUs are now programmable through languages like CUDA instead of just graphics APIs. It then explains that GPUs are designed for massively parallel throughput rather than sequential performance like CPUs. The rest of the document provides examples of programming GPUs for various applications, benchmarks showing large speedups over CPUs, and a brief history of GPU development and capabilities.

Uploaded by

Vishwa Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views92 pages

Kirk+Hwu GPU

Uploaded by

Vishwa Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 92

Programming Massively Parallel

Processors

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 1

ECE 408, University of Illinois, Urbana-Champaign
Why Massively Parallel Processor
• A quiet revolution and potential build-up
– Calculation: 367 GFLOPS vs. 32 GFLOPS
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API

– GPU in every PC and workstation – massive volume and potential

impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 2
ECE 408, University of Illinois, Urbana-Champaign
CPUs and GPUs have fundamentally
different design philosophies

ALU ALU
Control
ALU ALU
CPU GPU

Cache

DRAM DRAM

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 3

ECE 408, University of Illinois, Urbana-Champaign
Architecture of a CUDA-capable GPU
Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 4

ECE 408, University of Illinois, Urbana-Champaign
GT200 Characteristics
• 1 TFLOPS peak performance (25-50 times of current high-
end microprocessors)
• 265 GFLOPS sustained for apps such as VMD
• Massively parallel, 128 cores, 90W
• Massively threaded, sustains 1000s of threads per app
• 30-100 times speedup over high-end microprocessors on
scientific and media applications: medical imaging,
molecular dynamics

“I think they're right on the money, but the huge performance

differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s)
will invite close scrutiny so I have to be careful what I say
publically until I triple check those numbers.”
-John Stone, VMD group, Physics UIUC
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 5
ECE 408, University of Illinois, Urbana-Champaign
Future Apps Reflect a Concurrent
World
• Exciting applications in future mass computing
market have been traditionally considered
“supercomputing applications”
– Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
– These “Super-apps” represent and model physical,
concurrent world
• Various granularities of parallelism exist, but…
– programming model must not hinder parallel implementation
– data delivery needs careful management

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 6

ECE 408, University of Illinois, Urbana-Champaign
Stretching Traditional Architectures
• Traditional parallel architectures cover some
super-applications
– DSP, GPU, network apps, Scientific
• The game is to grow mainstream architectures
“out” or domain-specific architectures “in”
– CUDA is latter

Traditional applications

Current architecture
coverage

New applications

Domain-specific
architecture coverage

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Obstacles 7

ECE 408, University of Illinois, Urbana-Champaign
Samples of Previous Projects
Application Description Source Kernel % time
SPEC ‘06 version, change in guess vector
H.264 34,811 194 35%
SPEC ‘06 version, change to single precision
LBM and print fewer reports
1,481 285 >99%
Distributed.net RC5-72 challenge client code
RC5-72 1,979 218 >99%
Finite element modeling, simulation of 3D
FEM graded materials
1,874 146 99%
Rye Polynomial Equation Solver, quantum
RPES chem, 2-electron repulsion
1,104 281 99%
Petri Net simulation of a distributed system
PNS 322 160 >99%
Single-precision implementation of saxpy,
SAXPY used in Linpack’s Gaussian elim. routine
952 31 >99%
Two Point Angular Correlation Function
TRACF 536 98 96%
Finite-Difference Time Domain analysis of
FDTD 2D electromagnetic wave propagation
1,365 93 16%
Computing a matrix Q, a scanner’s
MRI-Q
© David Kirk/NVIDIA configuration
and Wen-mei W.inHwu,
MRI2007-2010
reconstruction
490 33 >99%
8
ECE 408, University of Illinois, Urbana-Champaign
Speedup of Applications
210 457 316
79 431 263
60
50 Kernel
Relative to CPU
GPU Speedup

Application
40
30
20
10
0
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-
FHD
• GeForce 8800 GTX vs. 2.2GHz Opteron 248
• 10× speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
• 25× to 400× speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 9

ECE 408, University of Illinois, Urbana-Champaign
GPU HISTORY

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 10

ECE 408, University of Illinois, Urbana-Champaign
Host CPU

Host Interface
GPU

Vertex Control
Vertex A Fixed Function
Cache
VS/T&L GPU Pipeline
Triangle Setup

Raster

Texture
Shader
Cache Frame
Buffer
ROP Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 11

FBI
ECE408, University of Illinois, Urbana-Champaign
Texture Mapping Example

Texture mapping example: painting a world map

texture image onto a globe object.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 12

ECE408, University of Illinois, Urbana-Champaign
Anti-Aliasing Example

Triangle Geometry
Triangle Geometry Aliased
Aliased Anti-Aliased
Anti-Aliased

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 13

ECE408, University of Illinois, Urbana-Champaign
Programmable Vertex and Pixel
Processors
3D Application
or Game
3D API
Commands

3D API:
CPU
OpenGL or
Direct3D
CPU – GPU Boundary

GPU
Command &
Assembled GPU
Polygons, Pixel
Data Stream Vertex Index Lines, and Location Pixel
Stream Points Stream Updates
GPU
Primitive Rasterization & Raster
Front Framebuffer
Assembly Interpolation Operation
End
s
Pre-transformed
Vertices Rasterized
Transformed Pre-transformed Transformed
Vertices Fragments Fragments
Programmable Programmable
Vertex Fragment
Processor Processor

An example of separate vertex processor and fragment processor in

a programmable graphics pipeline
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 14
ECE408, University of Illinois, Urbana-Champaign
Unified Graphics Pipeline
Host

Data Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 15

ECE408, University of Illinois, Urbana-Champaign
per thread
Input Registers per Shader
per Context

Fragment Program
Texture

Constants

Temp Registers

Output Registers

FB Memory
The restricted input and output capabilities of a shader programming model.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 16
ECE408, University of Illinois, Urbana-Champaign
CUDA PROGRAMMING
MODEL
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 17
ECE 408, University of Illinois, Urbana-Champaign
What is (Historical) GPGPU ?
• General Purpose computation using GPU and graphics API
in applications other than 3D graphics
– GPU accelerates critical path of application

• Data parallel algorithms leverage GPU attributes

– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org
– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 18
ECE 498AL, University of Illinois, Urbana-Champaign
Previous GPGPU Constraints
• Dealing with graphics API
per thread

– Working with the corner cases of the Input Registers

per Shader
per Context

graphics API Fragment Program

Texture

• Addressing modes Constants

– Limited texture size/dimension Temp Registers

• Shader capabilities
Output Registers
– Limited outputs
FB Memory

• Instruction sets
– Lack of Integer & bit ops
• Communication limited
– Between pixels
– Scatter a[i] = p

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 19

ECE 498AL, University of Illinois, Urbana-Champaign
CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
– User kicks off batches of threads on the GPU
– GPU = dedicated super-threaded, massively data parallel co-
processor
• Targeted software stack
– Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
– Standalone Driver - Optimized for computation
– Interface designed for compute – graphics-free API
– Data sharing with OpenGL buffer objects
– Guaranteed maximum download & readback speeds
– Explicit
© David Kirk/NVIDIA GPUW.memory
and Wen-mei management
Hwu, 2007-2010 20
ECE 498AL, University of Illinois, Urbana-Champaign
An Example of Physical Reality
Behind CUDA CPU
(host)
GPU w/
local DRAM
(device)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 21

ECE 498AL, University of Illinois, Urbana-Champaign
Parallel Computing on a GPU
• 8-series GPUs deliver 25 to 200+ GFLOPS
on compiled parallel C applications
– Available in laptops, desktops, and clusters GeForce 8800

• GPU parallelism is doubling every year

• Programming model scales transparently
Tesla D870

• Programmable in C with CUDA tools

• Multithreaded SPMD model uses application
data parallelism and thread parallelism

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Tesla22S870

ECE 498AL, University of Illinois, Urbana-Champaign
Overview
• CUDA programming model – basic
concepts and data types

• CUDA application programming interface -

basic

• Simple examples to illustrate basic concepts

and functionalities
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 23
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA – C with no shader
limitations!
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args); ...
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 24
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Devices and Threads
• A compute device
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
– Is typically a GPU but can also be another type of parallel processing
device
• Data-parallel portions of an application are expressed as device
kernels which run on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 25
ECE 498AL, University of Illinois, Urbana-Champaign
G80 – Graphics Mode
• The future of GPUs is programmable processing
• So – build the architecture around the processor
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB Kirk/NVIDIA and Wen-mei

© David FB FB
W. Hwu, 2007-2010 FB FB FB
26
ECE 498AL, University of Illinois, Urbana-Champaign
G80 CUDA mode – A Device Example
• Processors execute computing threads
• New operating mode/HW interface for
Host
computing Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Global Memory

27
ECE 498AL, University of Illinois, Urbana-Champaign
Extended C
• Declspecs
__device__ float filter[N];
– global, device, shared,
local, constant __global__ void convolve (float *image) {

shared float region[M];

...
• Keywords
– threadIdx, blockIdx region[threadIdx] = image[i];

• Intrinsics __syncthreads()
– __syncthreads ...

image[j] = result;
}
• Runtime API
– Memory, symbol, // Allocate GPU memory
void *myimage = cudaMalloc(bytes)
execution management

// 100 blocks, 10 threads per block

• Function launch convolve<<<100, 10>>> (myimage);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 28

ECE 498AL, University of Illinois, Urbana-Champaign
Extended C
Integrated source
(foo.cu)

cudacc
EDG C/C++ frontend
Open64 Global Optimizer

GPU Assembly CPU Host Code

foo.s foo.cpp

OCG gcc / cl

G80 SASS Mark Murphy, “NVIDIA’s Experience with

foo.sass Open64,”
www.capsl.udel.edu/conferences/open64/2008
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 /Papers/101.doc 29
ECE 498AL, University of Illinois, Urbana-Champaign
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)
– Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID 0 1 2 3 4 5 6 7

…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 30

ECE 498AL, University of Illinois, Urbana-Champaign
Thread Blocks: Scalable Cooperation
• Divide monolithic thread array into multiple
blocks
– Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
– Threads
Thread Blockin
0 different blocks
Thread Block 1 cannot cooperate
Thread Block N - 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
threadID

… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 31

ECE 498AL, University of Illinois, Urbana-Champaign
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to decide
Grid 1
what data to work on
Kernel Block Block
– Block ID: 1D or 2D 1 (0, 0) (1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when processing Kernel

2
multidimensional data Block (1, 1)
– Image processing (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Solving PDEs on volumes

Thread Thread Thread Thread
– … (0,0,0) (1,0,0) (2,0,0) (3,0,0)

Thread Thread Thread Thread

(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 32
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Memory Model Overview
• Global memory
– Main means of communicating
R/W Data between host and
Grid
device
– Contents visible to all threads Block (0, 0) Block (1, 0)

– Long latency access

Shared Memory Shared Memory
• We will focus on global
Registers Registers Registers Registers
memory for now
– Constant and texture memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
will come later
Host Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 33

ECE 498AL, University of Illinois, Urbana-Champaign
CUDA API Highlights:
Easy and Lightweight
• The API is an extension to the ANSI C
programming language
Low learning curve

• The hardware is designed to enable

lightweight runtime and driver
High performance

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 34

ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory Allocation
• cudaMalloc()
Grid
– Allocates object in the
device Global Memory Block (0, 0) Block (1, 0)

– Requires two parameters Shared Memory Shared Memory

Registers Registers Registers Registers

• Address of a pointer to the
allocated object
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Size of of allocated object

Host Global
• cudaFree() Memory

– Frees object from device

Global Memory
• Pointer to freed object
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 35
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory Allocation (cont.)
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– “d” is often used to indicate a device data structure

TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 36
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Host-Device Data Transfer
• cudaMemcpy()
Grid
– memory data transfer
Block (0, 0) Block (1, 0)
– Requires four parameters
Shared Memory Shared Memory
• Pointer to destination
Registers Registers Registers Registers
• Pointer to source
• Number of bytes copied Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Type of transfer
Host Global
– Host to Host Memory

– Host to Device
– Device to Host
– Device to Device

• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 498AL, University of Illinois, Urbana-Champaign
37
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic
cudaMemcpy(Md,
constants M, size, cudaMemcpyHostToDevice);
cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38

ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Keywords

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 39

ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

• global defines a kernel function

– Must return void
• __device__ and __host__ can be used
together
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 40
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations (cont.)

• device functions cannot have their

address taken
• For functions executed on the device:
– No recursion
– No static variable declarations inside the function
– No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 41

ECE 498AL, University of Illinois, Urbana-Champaign
Calling a Kernel Function – Thread Creation
• A kernel function must be called with an execution
configuration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared
memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes
>>>(...);
• Any call to a kernel function is asynchronous from
CUDA 1.0 on, explicit synch needed for blocking

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 42

ECE 498AL, University of Illinois, Urbana-Champaign
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that
illustrates the basic features of memory and
thread management in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread ID usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 43

ECE 498AL, University of Illinois, Urbana-Champaign
Programming Model:
Square Matrix Multiplication Example
N
• P = M * N of size WIDTH x WIDTH
• Without tiling:

WIDTH
– One thread calculates one element of P
– M and N are loaded WIDTH times from
global memory
M P

WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 44
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Layout of a Matrix in C
M0,0 M1,0 M2,0 M3,0

M0,1 M1,1 M2,1 M3,1

M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 45

ECE 498AL, University of Illinois, Urbana-Champaign
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host in double precision
N
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
{ k
for (int i = 0; i < Width; ++i)
j

WIDTH
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b; M P

}
P[i * Width + j] = sum; i
}

WIDTH
}
k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH
46
ECE 498AL, University of Illinois, Urbana-Champaign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device

cudaMalloc(&Pd, size);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 47
ECE 498AL, University of Illinois, Urbana-Champaign
Step 3: Output Matrix Data Transfer
(Host-side Code)

2. // Kernel invocation code – to be shown later

…

3. // Read P from the device

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices

cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 48

ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{

// Pvalue is used to store the element of the matrix

// that is computed by the thread
float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 49

ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Kernel Function (cont.)
for (int k = 0; k < Width; ++k) { Nd

float Melement = Md[threadIdx.y*Width+k];

float Nelement = Nd[k*Width+threadIdx.x]; k

WIDTH
Pvalue += Melement * Nelement;
} tx

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd

ty ty

WIDTH
k tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH 50
ECE 498AL, University of Illinois, Urbana-Champaign
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration

dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 51

ECE 498AL, University of Illinois, Urbana-Champaign
Only One Thread Block Used
Grid 1 Nd
• One Block of threads compute Block 1
2
matrix Pd
4
– Each thread computes one
element of Pd Thread
(2, 2)
2
• Each thread 6
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Perform one multiply and
addition for each pair of Md and
Nd elements
– Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not very
high)
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 52
ECE 498AL, University of Illinois, Urbana-Champaign
Step 7: Handling Arbitrary Sized Square
Matrices (will cover later)
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-

matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
Pd
(WIDTH/TILE_WIDTH) 2 blocks Md

You still need to put a loop by

around the kernel call for cases TILE_WIDTH
where WIDTH/TILE_WIDTH

WIDTH
ty
is greater than max grid size
(64K)! bx tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH 53
ECE 498AL, University of Illinois, Urbana-Champaign
Some Useful Information on
Tools

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 54

ECE 498AL, University of Illinois, Urbana-Champaign
Compiling a CUDA Program
C/C++ CUDA float4 me = gx[gtid];
me.x += me.y * me.z;
Application • Parallel Thread
eXecution (PTX)
– Virtual Machine
NVCC CPU Code and ISA
– Programming
model
PTX Code – Execution
Virtual resources and
state
PhysicalPTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
mad.f32 $f1, $f5, $f3, $f1;
Compiler

G80 … GPU

Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 55
ECE 498AL, University of Illinois, Urbana-Champaign
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)
• Must then be compiled with the rest of the application using another tool

– PTX
• Object code directly
• Or, PTX source, interpreted at runtime

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 56

ECE 498AL, University of Illinois, Urbana-Champaign
Linking
• Any executable with CUDA code requires two
dynamic libraries:
– The CUDA runtime library (cudart)
– The CUDA core library (cuda)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 57

ECE 498AL, University of Illinois, Urbana-Champaign
Debugging Using the
Device Emulation Mode
• An executable compiled in device emulation
mode (nvcc -deviceemu) runs
completely on the host using the CUDA
runtime
– No need of any device and CUDA driver
– Each device thread is emulated with a host thread

• Running in device emulation mode, one can:

– Use host native debug support (breakpoints, inspection, etc.)
– Access any device-specific data from host code and vice-versa
– Call any host function from device code (e.g. printf) and vice-
versa
– Detect
© David Kirk/NVIDIA and deadlock
Wen-mei W. Hwu, situations
2007-2010 caused by improper usage of 58
ECE 498AL, University of Illinois,
__syncthreads Urbana-Champaign
Device Emulation Mode Pitfalls
• Emulated device threads execute sequentially,
so simultaneous accesses of the same memory
location by multiple threads could produce
different results.
• Dereferencing device pointers on the host or
host pointers on the device can produce correct
results in device emulation mode, but will
generate an error in device execution mode

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 59

ECE 498AL, University of Illinois, Urbana-Champaign
Floating Point
• Results of floating-point computations will
slightly differ because of:
– Different compiler outputs, instruction sets
– Use of extended precision for intermediate results
• There are various options to force strict single precision
on the host

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 60

ECE 498AL, University of Illinois, Urbana-Champaign
COMPUTATIONAL THINKING

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 61

ECE 408, University of Illinois, Urbana-Champaign
Objective
• To provide you with a framework based on
the techniques and best practices used by
experienced parallel programmers for
– Thinking about the problem of parallel
programming
– Discussing your work with others
– Addressing performance and functionality issues
in your parallel program
– Using or building useful tools and environments
– understanding case studies and projects
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 62
ECE408, University of Illinois, Urbana-Champaign
Fundamentals of Parallel
Computing
• Parallel computing requires that
– The problem can be decomposed into sub-
problems that can be safely solved at the same
time
– The programmer structures the code and data to
solve these sub-problems concurrently
• The goals of parallel computing are
– To solve problems in less time, and/or
–The To problems
solve bigger mustproblems,
be large enough
and/orto justify parallel
computing and to exhibit exploitable concurrency.
– To achieve
© David Kirk/NVIDIA better
and Wen-mei W. Hwu, 2007-2010solutions 63
ECE408, University of Illinois, Urbana-Champaign
A Recommended Reading
Mattson, Sanders, Massingill, Patterns for
Parallel Programming, Addison Wesley,
2005, ISBN 0-321-22811-1.

– We draw quite a bit from the book

– A good overview of challenges, best practices,
and common techniques in all aspects of parallel
programming

ECE408, University of Illinois, Urbana-Champaign
Key Parallel Programming Steps
1) To find the concurrency in the problem
2) To structure the algorithm so that
concurrency can be exploited
3) To implement the algorithm in a suitable
programming environment
4) To execute and tune the performance of the
code on a parallel system
Unfortunately, these have not been separated into levels of
abstractions that can be dealt with independently.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 65
ECE408, University of Illinois, Urbana-Champaign
Challenges of Parallel
Programming
• Finding and exploiting concurrency often requires looking
at the problem from a non-obvious angle
– Computational thinking (J. Wing)
• Dependences need to be identified and managed
– The order of task execution may change the answers
• Obvious: One step feeds result to the next steps
• Subtle: numeric accuracy may be affected by ordering steps that are
logically parallel with each other
• Performance can be drastically reduced by many factors
– Overhead of parallel processing
– Load imbalance among processor elements
– Inefficient data sharing patterns
– Saturation of critical resources such as memory bandwidth

ECE408, University of Illinois, Urbana-Champaign
Shared Memory vs. Message
Passing
• We will focus on shared memory parallel
programming
– This is what CUDA is based on
– Future massively parallel microprocessors are expected
to support shared memory at the chip level
• The programming considerations of message
passing model is quite different!
– Look at MPI (Message Passing Interface) and its
relatives such as Charm++

ECE408, University of Illinois, Urbana-Champaign
Finding Concurrency in Problems
• Identify a decomposition of the problem into sub-
problems that can be solved simultaneously
– A task decomposition that identifies tasks for potential
concurrent execution
– A data decomposition that identifies data local to each task
– A way of grouping tasks and ordering the groups to satisfy
temporal constraints
– An analysis on the data sharing patterns among the
concurrent tasks
– A design evaluation that assesses of the quality the choices
made in all the steps
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 68
ECE408, University of Illinois, Urbana-Champaign
Finding Concurrency – The
Process
Dependence Analysis

Decomposition Group Tasks

Task Decomposition
Order Tasks Design Evaluation

Data Decomposition
Data Sharing

This is typically a iterative process.

Opportunities exist for dependence analysis to play earlier
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 69
role in decomposition.
ECE408, University of Illinois, Urbana-Champaign
Task Decomposition
• Many large problems can be naturally
decomposed into tasks – CUDA kernels are
largely tasks
– The number of tasks used should be adjustable to
the execution resources available.
– Each task must include sufficient work in order to
compensate for the overhead of managing their
parallel execution.
“In–an
Tasks shouldthe
ideal world, maximize reuse find
compiler would of sequential
tasks for the
programmer.
programUnfortunately, this almost
code to minimize never happens.”
effort.
- Mattson, Sanders, Massingill
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 70
ECE408, University of Illinois, Urbana-Champaign
Task Decomposition Example -
Square Matrix Multiplication N

• P = M * N of WIDTH ● WIDTH

WIDTH
– One natural task (sub-
problem) produces one
element of P
– All tasks can execute
M in P

parallel in this example.

WIDTH
WIDTH WIDTH

ECE408, University of Illinois, Urbana-Champaign
Task Decomposition Example –
Molecular Dynamics
• Simulation of motions of a large molecular system
• For each atom, there are natural tasks to calculate
– Vibrational forces
– Rotational forces
– Neighbors that must be considered in non-bonded forces
– Non-bonded forces
– Update position and velocity
– Misc physical properties based on motions
• Some of these can go in parallel for an atom
It is common that there are multiple ways to decompose any
given problem.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 72
ECE408, University of Illinois, Urbana-Champaign
NAMD
PatchList Data
Structure
Force & Energy
SPEC_NAMD SelfComputes Objects Calculation
Inner Loops
….... 144 iterations (per
patch)

PairComputes Objects

….
6 Different NAMD
….. 1872 iterations
(per patch pair)

Configurations
(all independent) Independent
Iterations

ECE408, University of Illinois, Urbana-Champaign
Data Decomposition
• The most compute intensive parts of many large
problem manipulate a large data structure
– Similar operations are being applied to different parts of
the data structure, in a mostly independent manner.
– This is what CUDA is optimized for.
• The data decomposition should lead to
– Efficient data usage by tasks within the partition
– Few dependencies across the tasks that work on different
partitions
– Adjustable partitions that can be varied according to the
hardware characteristics

ECE408, University of Illinois, Urbana-Champaign
Data Decomposition Example -
Square Matrix Multiplication N

• Row blocks
– Computing each partition requires

WIDTH
access to entire N array
• Square sub-blocks
– Only bands of M and N are needed
M P

WIDTH
WIDTH WIDTH

ECE408, University of Illinois, Urbana-Champaign
Tasks Grouping
• Sometimes natural tasks of a problem can be
grouped together to improve efficiency
– Reduced synchronization overhead – all tasks in the
group can use a barrier to wait for a common
dependence
– All tasks in the group efficiently share data loaded into a
common on-chip, shared storage (Shard Memory)
– Grouping and merging dependent tasks into one task
reduces need for synchronization
– CUDA thread blocks are task grouping examples.

ECE408, University of Illinois, Urbana-Champaign
Task Grouping Example -
Square Matrix Multiplication N

• Tasks calculating a P sub-block

– Extensive input data sharing,

WIDTH
reduced memory bandwidth using
Shared Memory
– All synched in execution
M P

WIDTH
WIDTH WIDTH

ECE408, University of Illinois, Urbana-Champaign
Task Ordering
• Identify the data and resource required by a
group of tasks before they can execute them
– Find the task group that creates it
– Determine a temporal order that satisfy all data
constraints

ECE408, University of Illinois, Urbana-Champaign
Task Ordering Example:
Molecular Dynamics

Neighbor List

Vibrational and
Rotational Forces
Non-bonded Force

Update atomic positions and velocities

Next Time Step

ECE408, University of Illinois, Urbana-Champaign
Data Sharing
• Data sharing can be a double-edged sword
– Excessive data sharing can drastically reduce advantage of parallel
execution
– Localized sharing can improve memory bandwidth efficiency
• Efficient memory bandwidth usage can be achieved by
synchronizing the execution of task groups and coordinating
their usage of memory data
– Efficient use of on-chip, shared storage
• Read-only sharing can usually be done at much higher
efficiency than read-write sharing, which often requires
synchronization

ECE408, University of Illinois, Urbana-Champaign
Data Sharing Example –
Matrix Multiplication
• Each task group will finish usage of each sub-block
of N and M before moving on
– N and M sub-blocks loaded into Shared Memory for use
by all threads of a P sub-block
– Amount of on-chip Shared Memory strictly limits the
number of threads working on a P sub-block
• Read-only shared data can be more efficiently
accessed as Constant or Texture data

ECE408, University of Illinois, Urbana-Champaign
Data Sharing Example –
Molecular Dynamics
• The atomic coordinates
– Read-only access by the neighbor list, bonded force, and non-
bonded force task groups
– Read-write access for the position update task group
• The force array
– Read-only access by position update group
– Accumulate access by bonded and non-bonded task groups
• The neighbor list
– Read-only access by non-bonded force task groups
– Generated by the neighbor list task group

ECE408, University of Illinois, Urbana-Champaign
Key Parallel Programming Steps
1) To find the concurrency in the problem
2) To structure the algorithm to translate
concurrency into performance
3) To implement the algorithm in a suitable
programming environment
4) To execute and tune the performance of the code
on a parallel system

Unfortunately, these have not been separated into levels of

abstractions that can be dealt with independently.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 83
ECE408, University of Illinois, Urbana-Champaign
Algorithm
• A step by step procedure that is guaranteed to terminate, such
that each step is precisely stated and can be carried out by a
computer
– Definiteness – the notion that each step is precisely stated
– Effective computability – each step can be carried out by a computer
– Finiteness – the procedure terminates

• Multiple algorithms can be used to solve the same problem

– Some require fewer steps
– Some exhibit more parallelism
– Some have larger memory footprint than others

ECE408, University of Illinois, Urbana-Champaign
Choosing Algorithm Structure

Start

Organize Organize by Organize by

by Task Data Data Flow

Linear Recursive Linear Recursive Regular Irregular

Task Divide and Geometric Recursive

Decomposition
Pipeline Event Driven
Parallelism Conquer Data

ECE408, University of Illinois, Urbana-Champaign
Mapping a Divide and Conquer Algorithm
Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10

0 1 2 3 4 5 6 7 8 9 10 11

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

3 0..7 8..15
iterations

ECE408, University of Illinois, Urbana-Champaign
Array elements 86
Tiled (Stenciled) Algorithms are
Important for Geometric
bx

•
Decomposition
A framework for memory
0 1 2

tx
data sharing and reuse by 012 bsize-1
increasing data access

BLOCK_WIDTHBLOCK_WIDTH
locality. N

– Tiled access patterns allow

WIDTH
small cache/scartchpad
memories to hold on to data
for re-use.
– For matrix multiplication, a
M P
16X16 thread block perform
2*256 = 512 float
0 loads from

device memory for 256 * 0

BLOCK_SIZE
(2*16) = 8,192 mul/add 12 Psub

WIDTH
operations. by 1
ty
bsize-1
• A convenient framework for
organizing threads2 (tasks) BLOCK_WIDTHBLOCK_WIDTH BLOCK_WIDTH

WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 87
ECE408, University of Illinois, Urbana-Champaign
Increased Work per Thread for even more
bx
0 1 2

locality tx
0 1 2 TILE_WIDTH-1

• Each thread computes two element of Pdsub

TILE_WIDTH
Nd

• Reduced loads from global memory (Md) to

WIDTH
shared memory

TILE_WIDTH
• Reduced instruction overhead
– More work done in each iteration

Md Pd

TILE_WIDTHE
0 Pdsub Pdsub

WIDTH
ty 1
by 1 2

TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 88
ECE408, University of Illinois, Urbana-Champaign
Double Buffering
- a frequently used algorithm pattern
• One could double buffer the computation, getting better
instruction mix within each thread
– This is classic software pipelining in ILP compilers
Loop { Load next tile from global memory

Load current tile to shared memory Loop {

Deposit current tile to shared memory
syncthreads() syncthreads()

Compute current tile Load next tile from global memory

syncthreads() Compute current tile

} syncthreads()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 89
ECE408, University of Illinois, Urbana-Champaign }
bx
Double Buffering 0 1 2

tx
0 1 2 TILE_WIDTH-1
• Deposit blue tile from register into

TILE_WIDTH
Nd
shared memory
• Syncthreads

WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md Pd
memory
0
• ….

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 90
ECE408, University of Illinois, Urbana-Champaign
(a) Direct summation
At each grid point, sum the
electrostatic potential from
all charges

(b) Cutoff summation

Electrostatic potential from
nearby charges summed;
spatially sort charges first

(c) Cutoff summation using

direct summation kernel
Spatially sort charges into
bins; adapt direct
summation to process a bin
Figure 10.2 Cutoff Summation algorithm
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 91
ECE408, University of Illinois, Urbana-Champaign
Cut-Off Summation Restores Data
Scalability

Same scalability
among all cutoff
implementations

Scalability and Performance of different algorithms for calculating

electrostatic potential map.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 92
ECE408, University of Illinois, Urbana-Champaign

GPU Computing Course Overview
No ratings yet
GPU Computing Course Overview
17 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
GPGPU
No ratings yet
GPGPU
139 pages
Unit 4
100% (1)
Unit 4
48 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lec 1
No ratings yet
Lec 1
27 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
AMPE Tema4 GPU Architecture
No ratings yet
AMPE Tema4 GPU Architecture
95 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Lec 14
No ratings yet
Lec 14
52 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Lec 6
No ratings yet
Lec 6
16 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
CUDA for Developers & Researchers
No ratings yet
CUDA for Developers & Researchers
77 pages
Gpus
No ratings yet
Gpus
32 pages
Note2 4
No ratings yet
Note2 4
11 pages
CUDA
No ratings yet
CUDA
46 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
2023 Deli Printer Product Catalogue
No ratings yet
2023 Deli Printer Product Catalogue
11 pages
420240-Super 25 Mic Q&A v2v
No ratings yet
420240-Super 25 Mic Q&A v2v
4 pages
Point-Of-Sale Business Application
No ratings yet
Point-Of-Sale Business Application
16 pages
Question Bank
No ratings yet
Question Bank
2 pages
Tutorial 3
No ratings yet
Tutorial 3
3 pages
WF-C878R 879R en
No ratings yet
WF-C878R 879R en
8 pages
Lab 1 - Programming The Pic16f887 Using Mplab and Proteus Isis
100% (2)
Lab 1 - Programming The Pic16f887 Using Mplab and Proteus Isis
4 pages
Project Olympus Universal Motherboard
No ratings yet
Project Olympus Universal Motherboard
40 pages
STVP Release Notes
No ratings yet
STVP Release Notes
34 pages
WorksheetWorks Word Search 1
100% (1)
WorksheetWorks Word Search 1
2 pages
Ira c5051 5045 5035 5030-pc
100% (2)
Ira c5051 5045 5035 5030-pc
250 pages
Brochure - Panasonic KX-Mb2062CX
No ratings yet
Brochure - Panasonic KX-Mb2062CX
2 pages
Tutorial 6 Using The Spartan-3E Starter Board LCD Display With ISE 10.1
100% (1)
Tutorial 6 Using The Spartan-3E Starter Board LCD Display With ISE 10.1
18 pages
Week 10
No ratings yet
Week 10
12 pages
8051 Instructions
No ratings yet
8051 Instructions
44 pages
Nec Ddc-Ci Naviset
No ratings yet
Nec Ddc-Ci Naviset
5 pages
Tutorial Arduino Advanced Resources NunoPessanhaSantos
No ratings yet
Tutorial Arduino Advanced Resources NunoPessanhaSantos
94 pages
RTOS Implementation On STM32.
No ratings yet
RTOS Implementation On STM32.
28 pages
ITEM - MASTER - PLEASE - UPDATE 2003
No ratings yet
ITEM - MASTER - PLEASE - UPDATE 2003
87 pages
Intel Clover Trail Schematics
100% (1)
Intel Clover Trail Schematics
34 pages
Computer Basics for Beginners
100% (2)
Computer Basics for Beginners
22 pages
GCN Architecture Whitepaper
No ratings yet
GCN Architecture Whitepaper
18 pages
Raid 5
No ratings yet
Raid 5
26 pages
HP Jumper Guide
No ratings yet
HP Jumper Guide
2 pages
Toshiba PC
No ratings yet
Toshiba PC
26 pages
159 - CS8493, CS6401 Operating Systems - Important Questions
No ratings yet
159 - CS8493, CS6401 Operating Systems - Important Questions
16 pages
Relnotes MZ43-G20 F21 Genoa
No ratings yet
Relnotes MZ43-G20 F21 Genoa
5 pages
Storage Devices
No ratings yet
Storage Devices
25 pages
Guía de Esamblador TM6293
No ratings yet
Guía de Esamblador TM6293
21 pages

Kirk+Hwu GPU

Uploaded by

Kirk+Hwu GPU

Uploaded by

Programming Massively Parallel

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 1

– GPU in every PC and workstation – massive volume and potential

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 3

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 4

“I think they're right on the money, but the huge performance

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 6

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Obstacles 7

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 9

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 10

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 11

Texture mapping example: painting a world map

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 12

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 13

An example of separate vertex processor and fragment processor in

Data Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 15

• Data parallel algorithms leverage GPU attributes

– Working with the corner cases of the Input Registers

graphics API Fragment Program

• Addressing modes Constants

– Limited texture size/dimension Temp Registers

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 19

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 21

• GPU parallelism is doubling every year

• Programmable in C with CUDA tools

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Tesla22S870

• CUDA application programming interface -

• Simple examples to illustrate basic concepts

Serial Code (host)

Parallel Kernel (device)

Serial Code (host)

Parallel Kernel (device)

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

FB Kirk/NVIDIA and Wen-mei

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Global Memory

__shared__ float region[M];

// 100 blocks, 10 threads per block

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 28

GPU Assembly CPU Host Code

G80 SASS Mark Murphy, “NVIDIA’s Experience with

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 30

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 31

• Simplifies memory Grid 2

addressing when processing Kernel

– Solving PDEs on volumes

Thread Thread Thread Thread

– Long latency access

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 33

• The hardware is designed to enable

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 34

– Requires two parameters Shared Memory Shared Memory

Registers Registers Registers Registers

• Size of of allocated object

– Frees object from device

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 39

• __global__ defines a kernel function

• __device__ functions cannot have their

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 41

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 42

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 43

M0,1 M1,1 M2,1 M3,1

M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 45

// Allocate P on the device

2. // Kernel invocation code – to be shown later

3. // Read P from the device

// Free device matrices

shared float region[M];

• global defines a kernel function

• device functions cannot have their