Who is this guy?
GPU Architecture
                                                                Analytical
         Patrick Cozzi                                        Graphics, Inc.
         University of Pennsylvania                              developer     lecturer                author                       editor
         CIS 371 Guest Lecture
         Spring 2012
                                                                                                                          See http://www.seas.upenn.edu/~pcozzi/
How did this happen?                                         Graphics Workloads
                                                               Triangles/vertices           and pixels/fragments
                          http://proteneer.com/blog/?p=263                         Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html
    Early 90s – Pre GPU                                                                                                  Why GPUs?
                                                                                                                           Graphics   workloads are embarrassingly
                                                                                                                           parallel
                                                                                                                            Data-parallel
                                                                                                                            Pipeline-parallel
                                                                                                                           CPU  and GPU execute in parallel
                                                                                                                           Hardware: texture filtering, rasterization,
                                                                                                                            etc.
                                           Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf
    Data Parallel                                                                                                        NVIDIA GeForce 6 (2004)
                                                                                                                                                                                                6 vertex
                                                                                                                                                                                            shader processors
      Beyond               Graphics
           Cloth simulation
           Particle system
           Matrix multiply
                                                                                                                                                                                                16 fragment
                                                                                                                                                                                             shader processors
Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306                          Image from http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter30.html
NVIDIA G80 Architecture                                                                  Why Unify Shader Processors?
             Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf                          Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Why Unify Shader Processors?                                                             GPU Architecture Big Ideas
                                                                                           GPUs    are specialized for
                                                                                            Compute-intensive,   highly parallel computation
                                                                                            Graphics is just the beginning.
                                                                                           Transistors   are devoted to:
                                                                                            Processing
                                                                                            Not:
                                                                                                Data caching
                                                                                                Flow control
             Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
                                                                                                         NVIDIA G80
    Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
NVIDIA G80                                                                                               NVIDIA G80
                  Streaming Processing (SP)
                                                                                                                 Streaming Multi-Processor (SM)
     NVIDIA G80                                                                                                      NVIDIA GT200
    16 SMs                                                                                                             30 SMs
    Each with 8 SPs                                                                                                    Each with 8 SPs
         128 total SPs                                                                                                      240 total SPs
    Each SM hosts up                                                                                                   Each SM hosts
     to 768 threads                                                                                                      up to
    Up to 12,288                                                                                                            1024 threads
     threads in flight                                                                                                  In flight, up to
                                                                                                                             30,720 threads
                          Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf                              Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf
                                                                                                                     GPU Computing History
                                                                                                                       2001/2002     – researchers see GPU as data-
          Let’s program                                                                                                   parallel coprocessor
                                                                                                                            The     GPGPU field is born
            this thing!                                                                                                2007
                                                                                                                            CUDA
                                                                                                                                    – NVIDIA releases CUDA
                                                                                                                                 – Compute Uniform Device Architecture
                                                                                                                            GPGPU shifts to GPU Computing
                                                                                                                       2008    – Khronos releases OpenCL
                                                                                                                          specification
CUDA Abstractions                                                                                              CUDA Kernels
  Ahierarchy of thread groups                                                                                   ExecutedN times in parallel by N different
  Shared memories                                                                                                CUDA threads
  Barrier synchronization
                                                                                                                                                           Thread ID
                                                                                                            Declaration
                                                                                                             Specifier
                                                                                                                                                            Execution
                                                                                                                                                           Configuration
CUDA Program Execution                                                                                         Thread Hierarchies
                                                                                                                 Grid      – one or more thread blocks
                                                                                                                    1D     or 2D
                                                                                                                 Block      – array of threads
                                                                                                                    1D, 2D, or 3D
                                                                                                                    Each block in a grid has the same number of
                                                                                                                     threads
                                                                                                                    Each thread in a block can
                                                                                                                            Synchronize
                                                                                                                            Access   shared memory
        Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies                                                                                                     Thread Hierarchies
                                                                                                                         Thread     Block
                                                                                                                          Group       of threads
                                                                                                                              G80 and GT200: Up to 512 threads
                                                                                                                              Fermi: Up to 1024 threads
                                                                                                                          Reside on same processor core
                                                                                                                          Share memory of that core
                   Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf                Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies                                                                                                     Thread Hierarchies
                                                                                                                         Threads       in a block
                                                                                                                          Share (limited) low-latency memory
                                                                                                                          Synchronize execution
                                                                                                                              Tocoordinate memory accesses
                                                                                                                              __syncThreads()
                                                                                                                                 Barrier      – threads in block wait until all threads reach this
                                                                                                                                 Lightweight
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf                    Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Scheduling Threads                                                                                           Scheduling Threads
  Warp   – threads from a block                                                                                Warps for three
                                                                                                                 blocks scheduled
   G80  / GT200 – 32 threads                                                                                    on the same SM.
   Run on the same SM
   Unit of thread scheduling
   Consecutive threadIdx values
   An implementation detail – in theory
      warpSize
                                                                                                                            Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf
Scheduling Threads                                                                                           Scheduling Threads
Remember this:
       Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf                                                  Slide from: http://courses.engr.illinois.edu/ece498/al/Syllabus.html
Scheduling Threads                      Scheduling Threads
  What happens if branches in a warp   Remember this:
  diverge?
                                                  Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Scheduling Threads                      Scheduling Threads
  32
    threads per warp but 8 SPs per        32threads per warp but 8 SPs per
  SM. What gives?                          SM. What gives?
                                          When an SM schedules a warp:
                                           Itsinstruction is ready
                                           8 threads enter the SPs on the 1st cycle
                                           8 more on the 2nd, 3rd, and 4th cycles
                                           Therefore, 4 cycles are required to
                                            dispatch a warp
Scheduling Threads                                                                                              Scheduling Threads
  Question                                                                                                       Solution
   A   kernel has                                                                                                 Each     warp has 4 multiples/adds
       1 global memory read (200 cycles)                                                                              16    cycles
       4 non-dependent multiples/adds                                                                             We     need to cover 200 cycles
   Howmany warps are required to hide the                                                                             200 / 16 = 12.5
    memory latency?                                                                                                    ceil(12.5) = 13
                                                                                                                   13   warps are required
Memory Model                                                                                                    Thread Synchronization
Recall:                                                                                                           Threads     in a block can synchronize
                                                                                                                   call__syncthreads to create a barrier
                                                                                                                   A thread waits at this call until all threads in
                                                                                                                    the block reach it, then all threads continue
                                                                                                                     Mds[i] = Md[j];
                                                                                                                     __syncthreads();
                                                                                                                     func(Mds[i], Mds[i + 1]);
            Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Synchronization                                   Thread Synchronization
         Thread 0                         Thread 1                Thread 0                   Thread 1
 Mds[i] = Md[j];               Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];
 __syncthreads();              __syncthreads();           __syncthreads();           __syncthreads();
 func(Mds[i], Mds[i+1]);       func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);
         Thread 2                         Thread 3                Thread 2                   Thread 3
 Mds[i] = Md[j];               Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];
 __syncthreads();              __syncthreads();           __syncthreads();           __syncthreads();
 func(Mds[i], Mds[i+1]);       func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);
                      Time: 0                                                  Time: 1
Thread Synchronization                                   Thread Synchronization
         Thread 0                         Thread 1                Thread 0                   Thread 1
 Mds[i] = Md[j];               Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];
 __syncthreads();              __syncthreads();           __syncthreads();           __syncthreads();
 func(Mds[i], Mds[i+1]);       func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);
         Thread 2                         Thread 3                Thread 2                   Thread 3
 Mds[i] = Md[j];               Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];
 __syncthreads();              __syncthreads();           __syncthreads();           __syncthreads();
 func(Mds[i], Mds[i+1]);       func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);
 Threads 0 and 1 are blocked at barrier
                      Time: 1                                                  Time: 2
Thread Synchronization                                Thread Synchronization
         Thread 0                   Thread 1                   Thread 0                       Thread 1
 Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];                Mds[i] = Md[j];
 __syncthreads();           __syncthreads();           __syncthreads();               __syncthreads();
 func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);        func(Mds[i], Mds[i+1]);
         Thread 2                   Thread 3                   Thread 2                       Thread 3
 Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];                Mds[i] = Md[j];
 __syncthreads();           __syncthreads();           __syncthreads();               __syncthreads();
 func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);        func(Mds[i], Mds[i+1]);
                                                       All threads in block have reached barrier, any thread
                                                       can continue
                      Time: 3                                                Time: 3
Thread Synchronization                                Thread Synchronization
         Thread 0                   Thread 1                   Thread 0                       Thread 1
 Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];                Mds[i] = Md[j];
 __syncthreads();           __syncthreads();           __syncthreads();               __syncthreads();
 func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);        func(Mds[i], Mds[i+1]);
         Thread 2                   Thread 3                   Thread 2                       Thread 3
 Mds[i] = Md[j];            Mds[i] = Md[j];            Mds[i] = Md[j];                Mds[i] = Md[j];
 __syncthreads();           __syncthreads();           __syncthreads();               __syncthreads();
 func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);    func(Mds[i], Mds[i+1]);        func(Mds[i], Mds[i+1]);
                      Time: 4                                                Time: 5
Thread Synchronization                            Thread Synchronization
  Why   is it important that execution time be
   similar among threads?
  Why does it only synchronize within a
   block?
                                                         Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf
Thread Synchronization                            Thread Synchronization
  Can __syncthreads() cause a thread
  to hang?                                        if (someFunc())
                                                  {
                                                    __syncthreads();
                                                  }
                                                  // ...
Thread Synchronization
if (someFunc())
{
   __syncthreads();
}
else
{
  __syncthreads();
}