Who is this guy?
GPU Architecture
Analytical
Patrick Cozzi Graphics, Inc.
University of Pennsylvania developer lecturer author editor
CIS 371 Guest Lecture
Spring 2012
See http://www.seas.upenn.edu/~pcozzi/
How did this happen? Graphics Workloads
Triangles/vertices and pixels/fragments
http://proteneer.com/blog/?p=263 Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html
Early 90s – Pre GPU Why GPUs?
Graphics workloads are embarrassingly
parallel
Data-parallel
Pipeline-parallel
CPU and GPU execute in parallel
Hardware: texture filtering, rasterization,
etc.
Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf
Data Parallel NVIDIA GeForce 6 (2004)
6 vertex
shader processors
Beyond Graphics
Cloth simulation
Particle system
Matrix multiply
16 fragment
shader processors
Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306 Image from http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter30.html
NVIDIA G80 Architecture Why Unify Shader Processors?
Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Why Unify Shader Processors? GPU Architecture Big Ideas
GPUs are specialized for
Compute-intensive, highly parallel computation
Graphics is just the beginning.
Transistors are devoted to:
Processing
Not:
Data caching
Flow control
Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
NVIDIA G80
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
NVIDIA G80 NVIDIA G80
Streaming Processing (SP)
Streaming Multi-Processor (SM)
NVIDIA G80 NVIDIA GT200
16 SMs 30 SMs
Each with 8 SPs Each with 8 SPs
128 total SPs 240 total SPs
Each SM hosts up Each SM hosts
to 768 threads up to
Up to 12,288 1024 threads
threads in flight In flight, up to
30,720 threads
Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
GPU Computing History
2001/2002 – researchers see GPU as data-
Let’s program parallel coprocessor
The GPGPU field is born
this thing! 2007
CUDA
– NVIDIA releases CUDA
– Compute Uniform Device Architecture
GPGPU shifts to GPU Computing
2008 – Khronos releases OpenCL
specification
CUDA Abstractions CUDA Kernels
Ahierarchy of thread groups ExecutedN times in parallel by N different
Shared memories CUDA threads
Barrier synchronization
Thread ID
Declaration
Specifier
Execution
Configuration
CUDA Program Execution Thread Hierarchies
Grid – one or more thread blocks
1D or 2D
Block – array of threads
1D, 2D, or 3D
Each block in a grid has the same number of
threads
Each thread in a block can
Synchronize
Access shared memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies Thread Hierarchies
Thread Block
Group of threads
G80 and GT200: Up to 512 threads
Fermi: Up to 1024 threads
Reside on same processor core
Share memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies Thread Hierarchies
Threads in a block
Share (limited) low-latency memory
Synchronize execution
Tocoordinate memory accesses
__syncThreads()
Barrier – threads in block wait until all threads reach this
Lightweight
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Scheduling Threads Scheduling Threads
Warp – threads from a block Warps for three
blocks scheduled
G80 / GT200 – 32 threads on the same SM.
Run on the same SM
Unit of thread scheduling
Consecutive threadIdx values
An implementation detail – in theory
warpSize
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf
Scheduling Threads Scheduling Threads
Remember this:
Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://courses.engr.illinois.edu/ece498/al/Syllabus.html
Scheduling Threads Scheduling Threads
What happens if branches in a warp Remember this:
diverge?
Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Scheduling Threads Scheduling Threads
32
threads per warp but 8 SPs per 32threads per warp but 8 SPs per
SM. What gives? SM. What gives?
When an SM schedules a warp:
Itsinstruction is ready
8 threads enter the SPs on the 1st cycle
8 more on the 2nd, 3rd, and 4th cycles
Therefore, 4 cycles are required to
dispatch a warp
Scheduling Threads Scheduling Threads
Question Solution
A kernel has Each warp has 4 multiples/adds
1 global memory read (200 cycles) 16 cycles
4 non-dependent multiples/adds We need to cover 200 cycles
Howmany warps are required to hide the 200 / 16 = 12.5
memory latency? ceil(12.5) = 13
13 warps are required
Memory Model Thread Synchronization
Recall: Threads in a block can synchronize
call__syncthreads to create a barrier
A thread waits at this call until all threads in
the block reach it, then all threads continue
Mds[i] = Md[j];
__syncthreads();
func(Mds[i], Mds[i + 1]);
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Synchronization Thread Synchronization
Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3 Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 0 Time: 1
Thread Synchronization Thread Synchronization
Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3 Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Threads 0 and 1 are blocked at barrier
Time: 1 Time: 2
Thread Synchronization Thread Synchronization
Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3 Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
All threads in block have reached barrier, any thread
can continue
Time: 3 Time: 3
Thread Synchronization Thread Synchronization
Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Thread 2 Thread 3 Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);
Time: 4 Time: 5
Thread Synchronization Thread Synchronization
Why is it important that execution time be
similar among threads?
Why does it only synchronize within a
block?
Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf
Thread Synchronization Thread Synchronization
Can __syncthreads() cause a thread
to hang? if (someFunc())
{
__syncthreads();
}
// ...
Thread Synchronization
if (someFunc())
{
__syncthreads();
}
else
{
__syncthreads();
}