KEMBAR78
Kirk+Hwu GPU | PDF | Shader | Graphics Processing Unit
0% found this document useful (0 votes)
83 views92 pages

Kirk+Hwu GPU

This document discusses programming massively parallel processors. It begins by noting the large performance advantages of GPUs compared to CPUs in terms of calculation speed, memory bandwidth, and the fact that GPUs are now programmable through languages like CUDA instead of just graphics APIs. It then explains that GPUs are designed for massively parallel throughput rather than sequential performance like CPUs. The rest of the document provides examples of programming GPUs for various applications, benchmarks showing large speedups over CPUs, and a brief history of GPU development and capabilities.

Uploaded by

Vishwa Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views92 pages

Kirk+Hwu GPU

This document discusses programming massively parallel processors. It begins by noting the large performance advantages of GPUs compared to CPUs in terms of calculation speed, memory bandwidth, and the fact that GPUs are now programmable through languages like CUDA instead of just graphics APIs. It then explains that GPUs are designed for massively parallel throughput rather than sequential performance like CPUs. The rest of the document provides examples of programming GPUs for various applications, benchmarks showing large speedups over CPUs, and a brief history of GPU development and capabilities.

Uploaded by

Vishwa Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Programming Massively Parallel

Processors

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 1


ECE 408, University of Illinois, Urbana-Champaign
Why Massively Parallel Processor
• A quiet revolution and potential build-up
– Calculation: 367 GFLOPS vs. 32 GFLOPS
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API

– GPU in every PC and workstation – massive volume and potential


impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 2
ECE 408, University of Illinois, Urbana-Champaign
CPUs and GPUs have fundamentally
different design philosophies

ALU ALU
Control
ALU ALU
CPU GPU

Cache

DRAM DRAM

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 3


ECE 408, University of Illinois, Urbana-Champaign
Architecture of a CUDA-capable GPU
Host

Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 4


ECE 408, University of Illinois, Urbana-Champaign
GT200 Characteristics
• 1 TFLOPS peak performance (25-50 times of current high-
end microprocessors)
• 265 GFLOPS sustained for apps such as VMD
• Massively parallel, 128 cores, 90W
• Massively threaded, sustains 1000s of threads per app
• 30-100 times speedup over high-end microprocessors on
scientific and media applications: medical imaging,
molecular dynamics

“I think they're right on the money, but the huge performance


differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s)
will invite close scrutiny so I have to be careful what I say
publically until I triple check those numbers.”
-John Stone, VMD group, Physics UIUC
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 5
ECE 408, University of Illinois, Urbana-Champaign
Future Apps Reflect a Concurrent
World
• Exciting applications in future mass computing
market have been traditionally considered
“supercomputing applications”
– Molecular dynamics simulation, Video and audio coding and
manipulation, 3D imaging and visualization, Consumer game
physics, and virtual reality products
– These “Super-apps” represent and model physical,
concurrent world
• Various granularities of parallelism exist, but…
– programming model must not hinder parallel implementation
– data delivery needs careful management

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 6


ECE 408, University of Illinois, Urbana-Champaign
Stretching Traditional Architectures
• Traditional parallel architectures cover some
super-applications
– DSP, GPU, network apps, Scientific
• The game is to grow mainstream architectures
“out” or domain-specific architectures “in”
– CUDA is latter

Traditional applications

Current architecture
coverage

New applications

Domain-specific
architecture coverage

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Obstacles 7


ECE 408, University of Illinois, Urbana-Champaign
Samples of Previous Projects
Application Description Source Kernel % time
SPEC ‘06 version, change in guess vector
H.264 34,811 194 35%
SPEC ‘06 version, change to single precision
LBM and print fewer reports
1,481 285 >99%
Distributed.net RC5-72 challenge client code
RC5-72 1,979 218 >99%
Finite element modeling, simulation of 3D
FEM graded materials
1,874 146 99%
Rye Polynomial Equation Solver, quantum
RPES chem, 2-electron repulsion
1,104 281 99%
Petri Net simulation of a distributed system
PNS 322 160 >99%
Single-precision implementation of saxpy,
SAXPY used in Linpack’s Gaussian elim. routine
952 31 >99%
Two Point Angular Correlation Function
TRACF 536 98 96%
Finite-Difference Time Domain analysis of
FDTD 2D electromagnetic wave propagation
1,365 93 16%
Computing a matrix Q, a scanner’s
MRI-Q
© David Kirk/NVIDIA configuration
and Wen-mei W.inHwu,
MRI2007-2010
reconstruction
490 33 >99%
8
ECE 408, University of Illinois, Urbana-Champaign
Speedup of Applications
210 457 316
79 431 263
60
50 Kernel
Relative to CPU
GPU Speedup

Application
40
30
20
10
0
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-
FHD
• GeForce 8800 GTX vs. 2.2GHz Opteron 248
• 10× speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
• 25× to 400× speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 9


ECE 408, University of Illinois, Urbana-Champaign
GPU HISTORY

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 10


ECE 408, University of Illinois, Urbana-Champaign
Host CPU

Host Interface
GPU

Vertex Control
Vertex A Fixed Function
Cache
VS/T&L GPU Pipeline
Triangle Setup

Raster

Texture
Shader
Cache Frame
Buffer
ROP Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 11


FBI
ECE408, University of Illinois, Urbana-Champaign
Texture Mapping Example

Texture mapping example: painting a world map


texture image onto a globe object.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 12


ECE408, University of Illinois, Urbana-Champaign
Anti-Aliasing Example

Triangle Geometry
Triangle Geometry Aliased
Aliased Anti-Aliased
Anti-Aliased

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 13


ECE408, University of Illinois, Urbana-Champaign
Programmable Vertex and Pixel
Processors
3D Application
or Game
3D API
Commands

3D API:
CPU
OpenGL or
Direct3D
CPU – GPU Boundary

GPU
Command &
Assembled GPU
Polygons, Pixel
Data Stream Vertex Index Lines, and Location Pixel
Stream Points Stream Updates
GPU
Primitive Rasterization & Raster
Front Framebuffer
Assembly Interpolation Operation
End
s
Pre-transformed
Vertices Rasterized
Transformed Pre-transformed Transformed
Vertices Fragments Fragments
Programmable Programmable
Vertex Fragment
Processor Processor

An example of separate vertex processor and fragment processor in


a programmable graphics pipeline
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 14
ECE408, University of Illinois, Urbana-Champaign
Unified Graphics Pipeline
Host

Data Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 15


ECE408, University of Illinois, Urbana-Champaign
per thread
Input Registers per Shader
per Context

Fragment Program
Texture

Constants

Temp Registers

Output Registers

FB Memory
The restricted input and output capabilities of a shader programming model.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 16
ECE408, University of Illinois, Urbana-Champaign
CUDA PROGRAMMING
MODEL
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 17
ECE 408, University of Illinois, Urbana-Champaign
What is (Historical) GPGPU ?
• General Purpose computation using GPU and graphics API
in applications other than 3D graphics
– GPU accelerates critical path of application

• Data parallel algorithms leverage GPU attributes


– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org
– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 18
ECE 498AL, University of Illinois, Urbana-Champaign
Previous GPGPU Constraints
• Dealing with graphics API
per thread

– Working with the corner cases of the Input Registers


per Shader
per Context

graphics API Fragment Program


Texture

• Addressing modes Constants

– Limited texture size/dimension Temp Registers

• Shader capabilities
Output Registers
– Limited outputs
FB Memory

• Instruction sets
– Lack of Integer & bit ops
• Communication limited
– Between pixels
– Scatter a[i] = p

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 19


ECE 498AL, University of Illinois, Urbana-Champaign
CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
– User kicks off batches of threads on the GPU
– GPU = dedicated super-threaded, massively data parallel co-
processor
• Targeted software stack
– Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
– Standalone Driver - Optimized for computation
– Interface designed for compute – graphics-free API
– Data sharing with OpenGL buffer objects
– Guaranteed maximum download & readback speeds
– Explicit
© David Kirk/NVIDIA GPUW.memory
and Wen-mei management
Hwu, 2007-2010 20
ECE 498AL, University of Illinois, Urbana-Champaign
An Example of Physical Reality
Behind CUDA CPU
(host)
GPU w/
local DRAM
(device)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 21


ECE 498AL, University of Illinois, Urbana-Champaign
Parallel Computing on a GPU
• 8-series GPUs deliver 25 to 200+ GFLOPS
on compiled parallel C applications
– Available in laptops, desktops, and clusters GeForce 8800

• GPU parallelism is doubling every year


• Programming model scales transparently
Tesla D870

• Programmable in C with CUDA tools


• Multithreaded SPMD model uses application
data parallelism and thread parallelism

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Tesla22S870


ECE 498AL, University of Illinois, Urbana-Champaign
Overview
• CUDA programming model – basic
concepts and data types

• CUDA application programming interface -


basic

• Simple examples to illustrate basic concepts


and functionalities
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 23
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA – C with no shader
limitations!
• Integrated host+device app C program
– Serial or modestly parallel parts in host C code
– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device)


KernelA<<< nBlk, nTid >>>(args); ...

Serial Code (host)

Parallel Kernel (device)


KernelB<<< nBlk, nTid >>>(args); ...
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 24
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Devices and Threads
• A compute device
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
– Is typically a GPU but can also be another type of parallel processing
device
• Data-parallel portions of an application are expressed as device
kernels which run on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 25
ECE 498AL, University of Illinois, Urbana-Champaign
G80 – Graphics Mode
• The future of GPUs is programmable processing
• So – build the architecture around the processor
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB Kirk/NVIDIA and Wen-mei


© David FB FB
W. Hwu, 2007-2010 FB FB FB
26
ECE 498AL, University of Illinois, Urbana-Champaign
G80 CUDA mode – A Device Example
• Processors execute computing threads
• New operating mode/HW interface for
Host
computing Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 Global Memory


27
ECE 498AL, University of Illinois, Urbana-Champaign
Extended C
• Declspecs
__device__ float filter[N];
– global, device, shared,
local, constant __global__ void convolve (float *image) {

__shared__ float region[M];


...
• Keywords
– threadIdx, blockIdx region[threadIdx] = image[i];

• Intrinsics __syncthreads()
– __syncthreads ...

image[j] = result;
}
• Runtime API
– Memory, symbol, // Allocate GPU memory
void *myimage = cudaMalloc(bytes)
execution management

// 100 blocks, 10 threads per block


• Function launch convolve<<<100, 10>>> (myimage);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 28


ECE 498AL, University of Illinois, Urbana-Champaign
Extended C
Integrated source
(foo.cu)

cudacc
EDG C/C++ frontend
Open64 Global Optimizer

GPU Assembly CPU Host Code


foo.s foo.cpp

OCG gcc / cl

G80 SASS Mark Murphy, “NVIDIA’s Experience with


foo.sass Open64,”
www.capsl.udel.edu/conferences/open64/2008
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 /Papers/101.doc 29
ECE 498AL, University of Illinois, Urbana-Champaign
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)
– Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID 0 1 2 3 4 5 6 7


float x = input[threadID];
float y = func(x);
output[threadID] = y;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 30


ECE 498AL, University of Illinois, Urbana-Champaign
Thread Blocks: Scalable Cooperation
• Divide monolithic thread array into multiple
blocks
– Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
– Threads
Thread Blockin
0 different blocks
Thread Block 1 cannot cooperate
Thread Block N - 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
threadID

… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 31


ECE 498AL, University of Illinois, Urbana-Champaign
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to decide
Grid 1
what data to work on
Kernel Block Block
– Block ID: 1D or 2D 1 (0, 0) (1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when processing Kernel


2
multidimensional data Block (1, 1)
– Image processing (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Solving PDEs on volumes


Thread Thread Thread Thread
– … (0,0,0) (1,0,0) (2,0,0) (3,0,0)

Thread Thread Thread Thread


(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 32
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Memory Model Overview
• Global memory
– Main means of communicating
R/W Data between host and
Grid
device
– Contents visible to all threads Block (0, 0) Block (1, 0)

– Long latency access


Shared Memory Shared Memory
• We will focus on global
Registers Registers Registers Registers
memory for now
– Constant and texture memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
will come later
Host Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 33


ECE 498AL, University of Illinois, Urbana-Champaign
CUDA API Highlights:
Easy and Lightweight
• The API is an extension to the ANSI C
programming language
Low learning curve

• The hardware is designed to enable


lightweight runtime and driver
High performance

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 34


ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory Allocation
• cudaMalloc()
Grid
– Allocates object in the
device Global Memory Block (0, 0) Block (1, 0)

– Requires two parameters Shared Memory Shared Memory

Registers Registers Registers Registers


• Address of a pointer to the
allocated object
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Size of of allocated object


Host Global
• cudaFree() Memory

– Frees object from device


Global Memory
• Pointer to freed object
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 35
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory Allocation (cont.)
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– “d” is often used to indicate a device data structure

TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 36
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Host-Device Data Transfer
• cudaMemcpy()
Grid
– memory data transfer
Block (0, 0) Block (1, 0)
– Requires four parameters
Shared Memory Shared Memory
• Pointer to destination
Registers Registers Registers Registers
• Pointer to source
• Number of bytes copied Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Type of transfer
Host Global
– Host to Host Memory

– Host to Device
– Device to Host
– Device to Device

• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 498AL, University of Illinois, Urbana-Champaign
37
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic
cudaMemcpy(Md,
constants M, size, cudaMemcpyHostToDevice);
cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 38


ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Keywords

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 39


ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host

• __global__ defines a kernel function


– Must return void
• __device__ and __host__ can be used
together
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 40
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations (cont.)

• __device__ functions cannot have their


address taken
• For functions executed on the device:
– No recursion
– No static variable declarations inside the function
– No variable number of arguments

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 41


ECE 498AL, University of Illinois, Urbana-Champaign
Calling a Kernel Function – Thread Creation
• A kernel function must be called with an execution
configuration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared
memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes
>>>(...);
• Any call to a kernel function is asynchronous from
CUDA 1.0 on, explicit synch needed for blocking

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 42


ECE 498AL, University of Illinois, Urbana-Champaign
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that
illustrates the basic features of memory and
thread management in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread ID usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 43


ECE 498AL, University of Illinois, Urbana-Champaign
Programming Model:
Square Matrix Multiplication Example
N
• P = M * N of size WIDTH x WIDTH
• Without tiling:

WIDTH
– One thread calculates one element of P
– M and N are loaded WIDTH times from
global memory
M P

WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 44
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Layout of a Matrix in C
M0,0 M1,0 M2,0 M3,0

M0,1 M1,1 M2,1 M3,1

M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3

M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 45


ECE 498AL, University of Illinois, Urbana-Champaign
Step 1: Matrix Multiplication
A Simple Host Version in C
// Matrix multiplication on the (CPU) host in double precision
N
void MatrixMulOnHost(float* M, float* N, float* P, int Width)
{ k
for (int i = 0; i < Width; ++i)
j

WIDTH
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b; M P

}
P[i * Width + j] = sum; i
}

WIDTH
}
k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH
46
ECE 498AL, University of Illinois, Urbana-Champaign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;

1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device


cudaMalloc(&Pd, size);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 47
ECE 498AL, University of Illinois, Urbana-Champaign
Step 3: Output Matrix Data Transfer
(Host-side Code)

2. // Kernel invocation code – to be shown later


3. // Read P from the device


cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices


cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 48


ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{

// Pvalue is used to store the element of the matrix


// that is computed by the thread
float Pvalue = 0;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 49


ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Kernel Function (cont.)
for (int k = 0; k < Width; ++k) { Nd

float Melement = Md[threadIdx.y*Width+k];


float Nelement = Nd[k*Width+threadIdx.x]; k

WIDTH
Pvalue += Melement * Nelement;
} tx

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd

ty ty

WIDTH
k tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH 50
ECE 498AL, University of Illinois, Urbana-Champaign
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration


dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!


MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 51


ECE 498AL, University of Illinois, Urbana-Champaign
Only One Thread Block Used
Grid 1 Nd
• One Block of threads compute Block 1
2
matrix Pd
4
– Each thread computes one
element of Pd Thread
(2, 2)
2
• Each thread 6
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Perform one multiply and
addition for each pair of Md and
Nd elements
– Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not very
high)
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 52
ECE 498AL, University of Illinois, Urbana-Champaign
Step 7: Handling Arbitrary Sized Square
Matrices (will cover later)
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-


matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
Pd
(WIDTH/TILE_WIDTH) 2 blocks Md

You still need to put a loop by


around the kernel call for cases TILE_WIDTH
where WIDTH/TILE_WIDTH

WIDTH
ty
is greater than max grid size
(64K)! bx tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH 53
ECE 498AL, University of Illinois, Urbana-Champaign
Some Useful Information on
Tools

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 54


ECE 498AL, University of Illinois, Urbana-Champaign
Compiling a CUDA Program
C/C++ CUDA float4 me = gx[gtid];
me.x += me.y * me.z;
Application • Parallel Thread
eXecution (PTX)
– Virtual Machine
NVCC CPU Code and ISA
– Programming
model
PTX Code – Execution
Virtual resources and
state
PhysicalPTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
mad.f32 $f1, $f5, $f3, $f1;
Compiler

G80 … GPU

Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 55
ECE 498AL, University of Illinois, Urbana-Champaign
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)
• Must then be compiled with the rest of the application using another tool

– PTX
• Object code directly
• Or, PTX source, interpreted at runtime

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 56


ECE 498AL, University of Illinois, Urbana-Champaign
Linking
• Any executable with CUDA code requires two
dynamic libraries:
– The CUDA runtime library (cudart)
– The CUDA core library (cuda)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 57


ECE 498AL, University of Illinois, Urbana-Champaign
Debugging Using the
Device Emulation Mode
• An executable compiled in device emulation
mode (nvcc -deviceemu) runs
completely on the host using the CUDA
runtime
– No need of any device and CUDA driver
– Each device thread is emulated with a host thread

• Running in device emulation mode, one can:


– Use host native debug support (breakpoints, inspection, etc.)
– Access any device-specific data from host code and vice-versa
– Call any host function from device code (e.g. printf) and vice-
versa
– Detect
© David Kirk/NVIDIA and deadlock
Wen-mei W. Hwu, situations
2007-2010 caused by improper usage of 58
ECE 498AL, University of Illinois,
__syncthreads Urbana-Champaign
Device Emulation Mode Pitfalls
• Emulated device threads execute sequentially,
so simultaneous accesses of the same memory
location by multiple threads could produce
different results.
• Dereferencing device pointers on the host or
host pointers on the device can produce correct
results in device emulation mode, but will
generate an error in device execution mode

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 59


ECE 498AL, University of Illinois, Urbana-Champaign
Floating Point
• Results of floating-point computations will
slightly differ because of:
– Different compiler outputs, instruction sets
– Use of extended precision for intermediate results
• There are various options to force strict single precision
on the host

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 60


ECE 498AL, University of Illinois, Urbana-Champaign
COMPUTATIONAL THINKING

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 61


ECE 408, University of Illinois, Urbana-Champaign
Objective
• To provide you with a framework based on
the techniques and best practices used by
experienced parallel programmers for
– Thinking about the problem of parallel
programming
– Discussing your work with others
– Addressing performance and functionality issues
in your parallel program
– Using or building useful tools and environments
– understanding case studies and projects
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 62
ECE408, University of Illinois, Urbana-Champaign
Fundamentals of Parallel
Computing
• Parallel computing requires that
– The problem can be decomposed into sub-
problems that can be safely solved at the same
time
– The programmer structures the code and data to
solve these sub-problems concurrently
• The goals of parallel computing are
– To solve problems in less time, and/or
–The To problems
solve bigger mustproblems,
be large enough
and/orto justify parallel
computing and to exhibit exploitable concurrency.
– To achieve
© David Kirk/NVIDIA better
and Wen-mei W. Hwu, 2007-2010solutions 63
ECE408, University of Illinois, Urbana-Champaign
A Recommended Reading
Mattson, Sanders, Massingill, Patterns for
Parallel Programming, Addison Wesley,
2005, ISBN 0-321-22811-1.

– We draw quite a bit from the book


– A good overview of challenges, best practices,
and common techniques in all aspects of parallel
programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 64


ECE408, University of Illinois, Urbana-Champaign
Key Parallel Programming Steps
1) To find the concurrency in the problem
2) To structure the algorithm so that
concurrency can be exploited
3) To implement the algorithm in a suitable
programming environment
4) To execute and tune the performance of the
code on a parallel system
Unfortunately, these have not been separated into levels of
abstractions that can be dealt with independently.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 65
ECE408, University of Illinois, Urbana-Champaign
Challenges of Parallel
Programming
• Finding and exploiting concurrency often requires looking
at the problem from a non-obvious angle
– Computational thinking (J. Wing)
• Dependences need to be identified and managed
– The order of task execution may change the answers
• Obvious: One step feeds result to the next steps
• Subtle: numeric accuracy may be affected by ordering steps that are
logically parallel with each other
• Performance can be drastically reduced by many factors
– Overhead of parallel processing
– Load imbalance among processor elements
– Inefficient data sharing patterns
– Saturation of critical resources such as memory bandwidth

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 66


ECE408, University of Illinois, Urbana-Champaign
Shared Memory vs. Message
Passing
• We will focus on shared memory parallel
programming
– This is what CUDA is based on
– Future massively parallel microprocessors are expected
to support shared memory at the chip level
• The programming considerations of message
passing model is quite different!
– Look at MPI (Message Passing Interface) and its
relatives such as Charm++

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 67


ECE408, University of Illinois, Urbana-Champaign
Finding Concurrency in Problems
• Identify a decomposition of the problem into sub-
problems that can be solved simultaneously
– A task decomposition that identifies tasks for potential
concurrent execution
– A data decomposition that identifies data local to each task
– A way of grouping tasks and ordering the groups to satisfy
temporal constraints
– An analysis on the data sharing patterns among the
concurrent tasks
– A design evaluation that assesses of the quality the choices
made in all the steps
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 68
ECE408, University of Illinois, Urbana-Champaign
Finding Concurrency – The
Process
Dependence Analysis

Decomposition Group Tasks

Task Decomposition
Order Tasks Design Evaluation

Data Decomposition
Data Sharing

This is typically a iterative process.


Opportunities exist for dependence analysis to play earlier
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 69
role in decomposition.
ECE408, University of Illinois, Urbana-Champaign
Task Decomposition
• Many large problems can be naturally
decomposed into tasks – CUDA kernels are
largely tasks
– The number of tasks used should be adjustable to
the execution resources available.
– Each task must include sufficient work in order to
compensate for the overhead of managing their
parallel execution.
“In–an
Tasks shouldthe
ideal world, maximize reuse find
compiler would of sequential
tasks for the
programmer.
programUnfortunately, this almost
code to minimize never happens.”
effort.
- Mattson, Sanders, Massingill
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 70
ECE408, University of Illinois, Urbana-Champaign
Task Decomposition Example -
Square Matrix Multiplication N

• P = M * N of WIDTH ● WIDTH

WIDTH
– One natural task (sub-
problem) produces one
element of P
– All tasks can execute
M in P

parallel in this example.

WIDTH
WIDTH WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 71


ECE408, University of Illinois, Urbana-Champaign
Task Decomposition Example –
Molecular Dynamics
• Simulation of motions of a large molecular system
• For each atom, there are natural tasks to calculate
– Vibrational forces
– Rotational forces
– Neighbors that must be considered in non-bonded forces
– Non-bonded forces
– Update position and velocity
– Misc physical properties based on motions
• Some of these can go in parallel for an atom
It is common that there are multiple ways to decompose any
given problem.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 72
ECE408, University of Illinois, Urbana-Champaign
NAMD
PatchList Data
Structure
Force & Energy
SPEC_NAMD SelfComputes Objects Calculation
Inner Loops
….... 144 iterations (per
patch)

PairComputes Objects

….
6 Different NAMD
….. 1872 iterations
(per patch pair)

Configurations
(all independent) Independent
Iterations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 73


ECE408, University of Illinois, Urbana-Champaign
Data Decomposition
• The most compute intensive parts of many large
problem manipulate a large data structure
– Similar operations are being applied to different parts of
the data structure, in a mostly independent manner.
– This is what CUDA is optimized for.
• The data decomposition should lead to
– Efficient data usage by tasks within the partition
– Few dependencies across the tasks that work on different
partitions
– Adjustable partitions that can be varied according to the
hardware characteristics

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 74


ECE408, University of Illinois, Urbana-Champaign
Data Decomposition Example -
Square Matrix Multiplication N

• Row blocks
– Computing each partition requires

WIDTH
access to entire N array
• Square sub-blocks
– Only bands of M and N are needed
M P

WIDTH
WIDTH WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 75


ECE408, University of Illinois, Urbana-Champaign
Tasks Grouping
• Sometimes natural tasks of a problem can be
grouped together to improve efficiency
– Reduced synchronization overhead – all tasks in the
group can use a barrier to wait for a common
dependence
– All tasks in the group efficiently share data loaded into a
common on-chip, shared storage (Shard Memory)
– Grouping and merging dependent tasks into one task
reduces need for synchronization
– CUDA thread blocks are task grouping examples.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 76


ECE408, University of Illinois, Urbana-Champaign
Task Grouping Example -
Square Matrix Multiplication N

• Tasks calculating a P sub-block


– Extensive input data sharing,

WIDTH
reduced memory bandwidth using
Shared Memory
– All synched in execution
M P

WIDTH
WIDTH WIDTH

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 77


ECE408, University of Illinois, Urbana-Champaign
Task Ordering
• Identify the data and resource required by a
group of tasks before they can execute them
– Find the task group that creates it
– Determine a temporal order that satisfy all data
constraints

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 78


ECE408, University of Illinois, Urbana-Champaign
Task Ordering Example:
Molecular Dynamics

Neighbor List

Vibrational and
Rotational Forces
Non-bonded Force

Update atomic positions and velocities

Next Time Step

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 79


ECE408, University of Illinois, Urbana-Champaign
Data Sharing
• Data sharing can be a double-edged sword
– Excessive data sharing can drastically reduce advantage of parallel
execution
– Localized sharing can improve memory bandwidth efficiency
• Efficient memory bandwidth usage can be achieved by
synchronizing the execution of task groups and coordinating
their usage of memory data
– Efficient use of on-chip, shared storage
• Read-only sharing can usually be done at much higher
efficiency than read-write sharing, which often requires
synchronization

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 80


ECE408, University of Illinois, Urbana-Champaign
Data Sharing Example –
Matrix Multiplication
• Each task group will finish usage of each sub-block
of N and M before moving on
– N and M sub-blocks loaded into Shared Memory for use
by all threads of a P sub-block
– Amount of on-chip Shared Memory strictly limits the
number of threads working on a P sub-block
• Read-only shared data can be more efficiently
accessed as Constant or Texture data

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 81


ECE408, University of Illinois, Urbana-Champaign
Data Sharing Example –
Molecular Dynamics
• The atomic coordinates
– Read-only access by the neighbor list, bonded force, and non-
bonded force task groups
– Read-write access for the position update task group
• The force array
– Read-only access by position update group
– Accumulate access by bonded and non-bonded task groups
• The neighbor list
– Read-only access by non-bonded force task groups
– Generated by the neighbor list task group

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 82


ECE408, University of Illinois, Urbana-Champaign
Key Parallel Programming Steps
1) To find the concurrency in the problem
2) To structure the algorithm to translate
concurrency into performance
3) To implement the algorithm in a suitable
programming environment
4) To execute and tune the performance of the code
on a parallel system

Unfortunately, these have not been separated into levels of


abstractions that can be dealt with independently.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 83
ECE408, University of Illinois, Urbana-Champaign
Algorithm
• A step by step procedure that is guaranteed to terminate, such
that each step is precisely stated and can be carried out by a
computer
– Definiteness – the notion that each step is precisely stated
– Effective computability – each step can be carried out by a computer
– Finiteness – the procedure terminates

• Multiple algorithms can be used to solve the same problem


– Some require fewer steps
– Some exhibit more parallelism
– Some have larger memory footprint than others

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 84


ECE408, University of Illinois, Urbana-Champaign
Choosing Algorithm Structure

Start

Organize Organize by Organize by


by Task Data Data Flow

Linear Recursive Linear Recursive Regular Irregular

Task Divide and Geometric Recursive


Decomposition
Pipeline Event Driven
Parallelism Conquer Data

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 85


ECE408, University of Illinois, Urbana-Champaign
Mapping a Divide and Conquer Algorithm
Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10

0 1 2 3 4 5 6 7 8 9 10 11

1 0+1 2+3 4+5 6+7 8+9 10+11

2 0...3 4..7 8..11

3 0..7 8..15
iterations

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010


ECE408, University of Illinois, Urbana-Champaign
Array elements 86
Tiled (Stenciled) Algorithms are
Important for Geometric
bx


Decomposition
A framework for memory
0 1 2

tx
data sharing and reuse by 012 bsize-1
increasing data access

BLOCK_WIDTHBLOCK_WIDTH
locality. N

– Tiled access patterns allow

WIDTH
small cache/scartchpad
memories to hold on to data
for re-use.
– For matrix multiplication, a
M P
16X16 thread block perform
2*256 = 512 float
0 loads from

device memory for 256 * 0

BLOCK_SIZE
(2*16) = 8,192 mul/add 12 Psub

WIDTH
operations. by 1
ty
bsize-1
• A convenient framework for
organizing threads2 (tasks) BLOCK_WIDTHBLOCK_WIDTH BLOCK_WIDTH

WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 87
ECE408, University of Illinois, Urbana-Champaign
Increased Work per Thread for even more
bx
0 1 2

locality tx
0 1 2 TILE_WIDTH-1

• Each thread computes two element of Pdsub

TILE_WIDTH
Nd

• Reduced loads from global memory (Md) to

WIDTH
shared memory

TILE_WIDTH
• Reduced instruction overhead
– More work done in each iteration

Md Pd

TILE_WIDTHE
0 Pdsub Pdsub

WIDTH
ty 1
by 1 2

TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 88
ECE408, University of Illinois, Urbana-Champaign
Double Buffering
- a frequently used algorithm pattern
• One could double buffer the computation, getting better
instruction mix within each thread
– This is classic software pipelining in ILP compilers
Loop { Load next tile from global memory

Load current tile to shared memory Loop {


Deposit current tile to shared memory
syncthreads() syncthreads()

Compute current tile Load next tile from global memory

syncthreads() Compute current tile


} syncthreads()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 89
ECE408, University of Illinois, Urbana-Champaign }
bx
Double Buffering 0 1 2

tx
0 1 2 TILE_WIDTH-1
• Deposit blue tile from register into

TILE_WIDTH
Nd
shared memory
• Syncthreads

WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md Pd
memory
0
• ….

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 90
ECE408, University of Illinois, Urbana-Champaign
(a) Direct summation
At each grid point, sum the
electrostatic potential from
all charges

(b) Cutoff summation


Electrostatic potential from
nearby charges summed;
spatially sort charges first

(c) Cutoff summation using


direct summation kernel
Spatially sort charges into
bins; adapt direct
summation to process a bin
Figure 10.2 Cutoff Summation algorithm
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 91
ECE408, University of Illinois, Urbana-Champaign
Cut-Off Summation Restores Data
Scalability

Same scalability
among all cutoff
implementations

Scalability and Performance of different algorithms for calculating


electrostatic potential map.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 92
ECE408, University of Illinois, Urbana-Champaign

You might also like