Kirk+Hwu GPU
Kirk+Hwu GPU
Processors
ALU ALU
Control
ALU ALU
CPU GPU
Cache
DRAM DRAM
Input Assembler
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
Global Memory
Traditional applications
Current architecture
coverage
New applications
Domain-specific
architecture coverage
Application
40
30
20
10
0
H.264 LBM RC5-72 FEM RPES PNS SAXPY TPACF FDTD MRI-Q MRI-
FHD
• GeForce 8800 GTX vs. 2.2GHz Opteron 248
• 10× speedup in a kernel is typical, as long as the kernel can occupy
enough parallel threads
• 25× to 400× speedup if the function’s data requirements and control flow
suit the GPU and the application is optimized
Host Interface
GPU
Vertex Control
Vertex A Fixed Function
Cache
VS/T&L GPU Pipeline
Triangle Setup
Raster
Texture
Shader
Cache Frame
Buffer
ROP Memory
Triangle Geometry
Triangle Geometry Aliased
Aliased Anti-Aliased
Anti-Aliased
3D API:
CPU
OpenGL or
Direct3D
CPU – GPU Boundary
GPU
Command &
Assembled GPU
Polygons, Pixel
Data Stream Vertex Index Lines, and Location Pixel
Stream Points Stream Updates
GPU
Primitive Rasterization & Raster
Front Framebuffer
Assembly Interpolation Operation
End
s
Pre-transformed
Vertices Rasterized
Transformed Pre-transformed Transformed
Vertices Fragments Fragments
Programmable Programmable
Vertex Fragment
Processor Processor
Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
FB FB FB FB FB FB
Fragment Program
Texture
Constants
Temp Registers
Output Registers
FB Memory
The restricted input and output capabilities of a shader programming model.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 16
ECE408, University of Illinois, Urbana-Champaign
CUDA PROGRAMMING
MODEL
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 17
ECE 408, University of Illinois, Urbana-Champaign
What is (Historical) GPGPU ?
• General Purpose computation using GPU and graphics API
in applications other than 3D graphics
– GPU accelerates critical path of application
• Shader capabilities
Output Registers
– Limited outputs
FB Memory
• Instruction sets
– Lack of Integer & bit ops
• Communication limited
– Between pixels
– Scatter a[i] = p
Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP
TF TF TF TF TF TF TF TF
L1 L1 L1 L1 L1 L1 L1 L1
L2 L2 L2 L2 L2 L2
Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache
Texture
Texture Texture Texture Texture Texture Texture Texture Texture
• Intrinsics __syncthreads()
– __syncthreads ...
image[j] = result;
}
• Runtime API
– Memory, symbol, // Allocate GPU memory
void *myimage = cudaMalloc(bytes)
execution management
cudacc
EDG C/C++ frontend
Open64 Global Optimizer
OCG gcc / cl
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …
Courtesy: NDVIA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 32
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Memory Model Overview
• Global memory
– Main means of communicating
R/W Data between host and
Grid
device
– Contents visible to all threads Block (0, 0) Block (1, 0)
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 36
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Host-Device Data Transfer
• cudaMemcpy()
Grid
– memory data transfer
Block (0, 0) Block (1, 0)
– Requires four parameters
Shared Memory Shared Memory
• Pointer to destination
Registers Registers Registers Registers
• Pointer to source
• Number of bytes copied Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
• Type of transfer
Host Global
– Host to Host Memory
– Host to Device
– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010
ECE 498AL, University of Illinois, Urbana-Champaign
37
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic
cudaMemcpy(Md,
constants M, size, cudaMemcpyHostToDevice);
cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
WIDTH
– One thread calculates one element of P
– M and N are loaded WIDTH times from
global memory
M P
WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 44
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Layout of a Matrix in C
M0,0 M1,0 M2,0 M3,0
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
WIDTH
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * width + j];
sum += a * b; M P
}
P[i * Width + j] = sum; i
}
WIDTH
}
k
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH
46
ECE 498AL, University of Illinois, Urbana-Champaign
Step 2: Input Matrix Data Transfer
(Host-side Code)
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
WIDTH
Pvalue += Melement * Nelement;
} tx
Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd
ty ty
WIDTH
k tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH 50
ECE 498AL, University of Illinois, Urbana-Champaign
Step 5: Kernel Invocation
(Host-side Code)
WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
Pd
(WIDTH/TILE_WIDTH) 2 blocks Md
WIDTH
ty
is greater than max grid size
(64K)! bx tx
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 WIDTH WIDTH 53
ECE 498AL, University of Illinois, Urbana-Champaign
Some Useful Information on
Tools
G80 … GPU
Target code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 55
ECE 498AL, University of Illinois, Urbana-Champaign
Compilation
• Any source file containing CUDA language
extensions must be compiled with NVCC
• NVCC is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• NVCC outputs:
– C code (host CPU Code)
• Must then be compiled with the rest of the application using another tool
– PTX
• Object code directly
• Or, PTX source, interpreted at runtime
Task Decomposition
Order Tasks Design Evaluation
Data Decomposition
Data Sharing
• P = M * N of WIDTH ● WIDTH
WIDTH
– One natural task (sub-
problem) produces one
element of P
– All tasks can execute
M in P
WIDTH
WIDTH WIDTH
PairComputes Objects
….
6 Different NAMD
….. 1872 iterations
(per patch pair)
Configurations
(all independent) Independent
Iterations
• Row blocks
– Computing each partition requires
WIDTH
access to entire N array
• Square sub-blocks
– Only bands of M and N are needed
M P
WIDTH
WIDTH WIDTH
WIDTH
reduced memory bandwidth using
Shared Memory
– All synched in execution
M P
WIDTH
WIDTH WIDTH
Neighbor List
Vibrational and
Rotational Forces
Non-bonded Force
Start
0 1 2 3 4 5 6 7 8 9 10 11
3 0..7 8..15
iterations
•
Decomposition
A framework for memory
0 1 2
tx
data sharing and reuse by 012 bsize-1
increasing data access
BLOCK_WIDTHBLOCK_WIDTH
locality. N
WIDTH
small cache/scartchpad
memories to hold on to data
for re-use.
– For matrix multiplication, a
M P
16X16 thread block perform
2*256 = 512 float
0 loads from
BLOCK_SIZE
(2*16) = 8,192 mul/add 12 Psub
WIDTH
operations. by 1
ty
bsize-1
• A convenient framework for
organizing threads2 (tasks) BLOCK_WIDTHBLOCK_WIDTH BLOCK_WIDTH
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 87
ECE408, University of Illinois, Urbana-Champaign
Increased Work per Thread for even more
bx
0 1 2
locality tx
0 1 2 TILE_WIDTH-1
TILE_WIDTH
Nd
WIDTH
shared memory
TILE_WIDTH
• Reduced instruction overhead
– More work done in each iteration
Md Pd
TILE_WIDTHE
0 Pdsub Pdsub
WIDTH
ty 1
by 1 2
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2
WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 88
ECE408, University of Illinois, Urbana-Champaign
Double Buffering
- a frequently used algorithm pattern
• One could double buffer the computation, getting better
instruction mix within each thread
– This is classic software pipelining in ILP compilers
Loop { Load next tile from global memory
tx
0 1 2 TILE_WIDTH-1
• Deposit blue tile from register into
TILE_WIDTH
Nd
shared memory
• Syncthreads
WIDTH
TILE_WIDTH
• Load orange tile into register
• Compute Blue tile
• Deposit orange tile into shared
Md Pd
memory
0
• ….
TILE_WIDTHE
0 Pdsub
WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 90
ECE408, University of Illinois, Urbana-Champaign
(a) Direct summation
At each grid point, sum the
electrostatic potential from
all charges
Same scalability
among all cutoff
implementations