14 Parallel Algorithms CUDA Basics s20
14 Parallel Algorithms CUDA Basics s20
Parallel Algorithms
         CUDA
       Chris Rossbach
          cs378h
Outline for Today
• Questions?
• Administrivia
    • Eldar-* machines should be available
• Agenda
    • Parallel Algorithms
    • CUDA
• Acknowledgements:
  http://developer.download.nvidia.com/compute/develo
  pertrainingmaterials/presentations/cuda_language/Intro
  duction_to_CUDA_C.pptx
                                                           2
Faux Quiz Questions
• What is a reduction? A prefix sum? Why are they hard to parallelize and what basic techniques
  can be used to parallelize them?
• Define flow dependence, output dependence, and anti-dependence: give an example of each.
  Why/how do compilers use them to detect loop-independent vs loop-carried dependences?
• What is the difference between a thread-block and a warp?
• How/Why must programmers copy data back and forth to a GPU?
• What is “shared memory” in CUDA? Describe a setting in which it might be useful.
• CUDA kernels have implicit barrier synchronization. Why is __syncthreads() necessary in light of
  this fact?
• How might one implement locks on a GPU?
• What ordering guarantees does a GPU provide across different hardware threads’ access to a
  single memory location? To two disjoint locations?
• When is it safe for one GPU thread to wait (e.g. by spinning) for another?
                                                                                                     3
      Review: what is a vector processor?
Dont decode same instruction
      over and over…
                                Implementation:
                                • Instruction fetch control logic shared
                                • Same instruction stream executed on
                                • Multiple pipelines
                                • Multiple different operands in parallel
                                                                      4
When does vector processing help?
                                  Pros? Cons?
Fine-grained multithreading
• Threads interleave instructions
    • Round-robin
    • Skip stalled threads
• Hardware support required
    • Separate PC and register file per thread
    • Hardware to control alternating pattern
• Naturally hides delays
    • Data hazards, Cache misses
    • Pipeline runs with rare stalls
          Pros? Cons?
Simultaneous Multithreading (SMT)
• Instructions from multiple threads
  issued on same cycle
   • Uses register renaming
   • dynamic scheduling facility of multi-   Skip C
     issue architecture
• Hardware support:
   • Register files, PCs per thread          Skip A
   • Temporary result registers pre commit
   • Support to sort out which threads get
     results from which instructions
            Pros? Cons?
Why Vector and Multithreading Background?
    GPU:
    • A very wide vector machine
    • Massively multi-threaded to hide memory latency
    • Originally designed for graphics pipelines…
                   Graphics ~= Rendering
Inputs
• 3D world model(objects, materials)
    •   Geometry modeled w triangle meshes, surface normals
    •   GPUs subdivide triangles into “fragments” (rasterization)
    •   Materials modeled with “textures”
    •   Texture coordinates, sampling “map” textures →
        geometry
• Light locations and properties
    • Attempt to model surtface/light interactions with
      modeled objects/materials
• View point
Output
• 2D projection seen from the view-point
  3/8/2020                                                          12
Grossly over-simplified rendering algorithm
foreach(vertex v in model)
      map vmodel → vview
fragment[] frags = {};
foreach triangle t (v0, v1, v2)
      frags.add(rasterize(t));
foreach fragment f in frags
      choose_color(f);
display(visible_fragments(frags));
3/8/2020                             Dandelion   13
Algorithm → Graphics Pipeline
           foreach(vertex v in model)
                 map vmodel → vview
           fragment[] frags = {};
           foreach triangle t (v0, v1, v2)
                 frags.add(rasterize(t));
           foreach fragment f in frags
                 choose_color(f);
           display(visible_fragments(frags));
                                                          OpenGL pipeline
                                                   To first order, DirectX looks the same!
3/8/2020                               Dandelion                                             14
Graphics pipeline → GPU architecture
3/8/2020                                       Dandelion                                                   20
Programming Model
• GPUs are I/O devices, managed by user-code
• “kernels” == “shader programs”
• 1000s of HW-scheduled threads per kernel
• Threads grouped into independent blocks.
      • Threads in a block can synchronize (barrier)
      • This is the *only* synchronization
• “Grid” == “launch” == “invocation” of a kernel
      • a group of blocks (or warps)
                                                 Need codes that are 1000s-X
                                                         parallel….
3/8/2020                                                                       21
Parallel Algorithms
• Sequential algorithms often do not permit easy parallelization
   •   Does not mean there work has no parallelism
   •   A different approach can yield parallelism
   •   but often changes the algorithm
   •   Parallelizing != just adding locks to a sequential algorithm
• Parallel Patterns
   •   Map
   •   Scatter, Gather
   •   Reduction                                If you can express your
                                            algorithm using these patterns,
   •   Scan                                  an apparently fundamentally
   •   Search, Sort                           sequential algorithm can be
                                                     made parallel
   Map
   • Inputs
      • Array A
      • Function f(x)
   • map(A, f) → apply f(x) on all elements in A
   • Parallelism trivially exposed
      • f(x) can be applied in parallel to all elements, in principle
• Gather:
   • Read multiple items to single /packed location
• Scatter:
   • Write single/packed data item to multiple locations
• Inputs: x, y, indeces, N
                               foreach(T
                                foreach(Telem
                                          elemininPF(ints))
                                                   ints)
     10 30 20 10 20 30 10      {{
                                 key
                                  key = =KeyLambda(elem);
                                          KeyLambda(elem);
    10 10 10   30 30   20 20     group
                                  group= =GetGroup(key);
                                           GetGroup(key);
                                 group.Add(elem);
                                  group.Add(elem);
                               }}
                                                              29
GroupBy using parallel primitives
                     10    30    20   10   20   30   10
                                                                                        30
Sort
• OK, let’s build a parallel sort
                                    31
Summary
Re-expressing apparently sequential algorithms as combinations of
parallel patterns is a common technique when targeting GPUs
• Reductions
• Scans
• Re-orderings (scatter/gather)
• Sort
• Map
                                                                    32
What is CUDA?
• CUDA Architecture
  • Expose GPU parallelism for general-purpose computing
  • Retain performance
• CUDA C/C++
  • Based on industry-standard C/C++
  • Small set of extensions to enable heterogeneous programming
  • Straightforward APIs to manage devices, memory etc.
                                                                  33
           Heterogeneous Computing
Blocks
Threads
           Indexing
CONCEPTS
           Shared memory
__syncthreads()
Asynchronous operation
Handling errors
           Managing devices
                                     34
               CONCEPTS   Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
HELLO WORLD!
                          Handling errors
                          Managing devices
Heterogeneous Computing
    ▪ Terminology:
       ▪ Host   The CPU and its memory (host memory)
       ▪ Device The GPU and its memory (device memory)
                Host                             Device
                                                          36
Heterogeneous Computing
      #include <iostream>
      #include <algorithm>
      #define N     1024
      #define RADIUS 3
      #define BLOCK_SIZE 16
                                                                               parallel fn
               // Synchronize (ensure all the data is available)
               __syncthreads();
      int main(void) {
              int *in, *out;      // host copies of a, b, c
              int *d_in, *d_out;      // device copies of a, b, c
              int size = (N + 2*RADIUS) * sizeof(int);
                                                                               serial code
               cudaMalloc((void **)&d_out, size);
               // Copy to device
               cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
               cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
                                                                               parallel code
               cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
               // Cleanup
               free(in); free(out);
               cudaFree(d_in); cudaFree(d_out);
               return 0;
                                                                               serial code
      }
                                                                                               37
Simple Processing Flow
PCI Bus
                                             38
Simple Processing Flow
PCI Bus
                                               39
Simple Processing Flow
PCI Bus
                                               40
Hello World!
         int main(void) {
               printf("Hello World!\n");
               return 0;
         }
                                           Output:
      Standard C that runs on the host
                                           $ nvcc
                                           hello_world.
      NVIDIA compiler (nvcc) can be used   cu
      to compile programs with no device   $ a.out
      code                                 Hello World!
                                           $
                                                          41
Hello World! with Device Code
         __global__ void mykernel(void) {
         }
         int main(void) {
               mykernel<<<1,1>>>();
               printf("Hello World!\n");
               return 0;
         }
                                            42
Hello World! with Device Code
      __global__ void mykernel(void) {
      }
                                                                        43
Hello World! with Device COde
      mykernel<<<1,1>>>();
• Triple angle brackets mark a call from host code to device code
   • Also called a “kernel launch”
   • We’ll return to the parameters (1,1) in a moment
                                                                    44
Hello World! with Device Code
         __global__ void mykernel(void){
         }
                                           Output:
         int main(void) {
               mykernel<<<1,1>>>();
                                           $ nvcc
               printf("Hello World!\n");
                                           hello.cu
               return 0;
                                           $ a.out
         }
                                           Hello World!
                                           $
    • mykernel() does nothing,
      somewhat anticlimactic!
                                                          45
Parallel Programming in CUDA C/C++
    • But wait… GPU computing is about
      massive parallelism!
a b c
                                                           46
Addition on the Device
• A simple kernel to add two integers
                                                         47
Addition on the Device
• Note that we use pointers for the variables
                                                                          48
Memory Management
   • Host and device memory are separate entities
      • Device pointers point to GPU memory
          May be passed to/from host code
          May not be dereferenced in host code
      • Host pointers point to CPU memory
          May be passed to/from device code
          May not be dereferenced in device code
                                                                  49
Addition on the Device: add()
• Returning to our add() kernel
                                                      50
Addition on the Device: main()
       int main(void) {
              int a, b, c;              // host copies of a, b, c
              int *d_a, *d_b, *d_c;     // device copies of a, b, c
              int size = sizeof(int);
                                                                      51
Addition on the Device: main()
           // Copy inputs to device
           cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
           cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
           // Cleanup
           cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
           return 0;
       }
                                                                52
             CONCEPTS   Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
RUNNING IN
                        Handling errors
Managing devices
PARALLEL
                                                  53
Moving to Parallel
• GPU computing is about massive parallelism
   • So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
                                                                 54
Vector Addition on the Device
   • With add() running in parallel we can do vector addition
                                                                         55
Vector Addition on the Device
          __global__ void add(int *a, int *b, int *c) {
                     c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
          }
                                                                                                           56
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
                                                              57
Vector Addition on the Device: main()
     #define N 512
     int main(void) {
        int *a, *b, *c;                // host copies of a, b, c
        int *d_a, *d_b, *d_c;          // device copies of a, b, c
        int size = N * sizeof(int);
                                                                             58
Vector Addition on the Device: main()
         // Copy inputs to device
         cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
         cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
         // Cleanup
         free(a); free(b); free(c);
         cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
         return 0;
     }
                                                             59
Review
• Difference between host and device   • Basic device memory
   • Host      CPU                       management
   • Device    GPU                        • cudaMalloc()
                                          • cudaMemcpy()
                                          • cudaFree()
• __global__   declares device code
   • Executes on the device
   • Called from the host              • Launching parallel kernels
                                          • Launch N copies of add() with
                                           add<<<N,1>>>(…);
• Passing parameters from host code       • Use blockIdx.x to access
  to a device function                      block index
                                                                        60
              CONCEPTS   Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
INTRODUCING
                         Handling errors
Managing devices
THREADS
                                                   61
CUDA Threads
 • Terminology: a block can be split into parallel threads
         // Cleanup
         free(a); free(b); free(c);
         cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
         return 0;
     }
                                                             64
            CONCEPTS   Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
                       Handling errors
COMBINING THREADS      Managing devices
AND BLOCKS
                                                 65
Combining Blocks and Threads
• We’ve seen parallel vector addition using:
   • Many blocks with one thread each
   • One block with many threads
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
                                                                              67
Indexing Arrays: Example
• Which thread will operate on the red element?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
                                                                   69
Addition with Blocks and Threads:
main()
 #define N (2048*2048)
 #define THREADS_PER_BLOCK 512
 int main(void) {
     int *a, *b, *c;                       // host copies of a, b, c
     int *d_a, *d_b, *d_c;          // device copies of a, b, c
     int size = N * sizeof(int);
                                                                          70
Addition with Blocks and Threads:
main()
     // Copy inputs to device
     cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
     cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
     // Cleanup
     free(a); free(b); free(c);
     cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
     return 0;
 }
                                                                       71
Handling Arbitrary Vector Sizes
      • Typical problems are not friendly multiples of
       blockDim.x
                                                                  72
Why Bother with Threads?
• Threads seem unnecessary
   • They add a level of complexity
   • What do we gain?
                                                        73
              CONCEPTS   Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
COOPERATING
                         Handling errors
Managing devices
THREADS
                                                   75
Stencils
• Each pixel → function of neighbors
• Edge detection:
• Blur:
                                       76
1D Stencil
 • Consider 1D stencil over 1D array of elements
    • Each output element is the sum of input elements within a radius
radius radius
                                                                         77
  Implementation within a block
• Each thread: process 1 output element          __global__ void stencil_1d(int *in, int *out) {
                                                   // note: idx comp & edge conditions omitted…
   • blockDim.x elements per block                 int result = 0;
                                                   for (int offset = -R; offset <= R; offset++)
                                                     result += in[idx + offset];
• Input elements read many times
   • With radius 3, each input element is read       // Store the result
     seven times                                     out[idx] = result;
                                                 }
                                                                                          78
  Implementation within a block
• Each thread: process 1 output element          __global__ void stencil_1d(int *in, int *out) {
                                                   // note: idx comp & edge conditions omitted…
   • blockDim.x elements per block                 int result = 0;
                                                   for (int offset = -R; offset <= R; offset++)
                                                     result += in[idx + offset];
• Input elements read many times
   • With radius 3, each input element is read       // Store the result
     seven times                                     out[idx] = result;
                                                 }
                                                                                Why is this a
                                                                                 problem?
                                                                                           79
Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory
                                                                      80
Stencil with Shared Memory
• Cache data in shared memory
  – Read (blockDim.x + 2 * radius) elements from memory to shared
  – Compute blockDim.x output elements
  – Write blockDim.x output elements to global memory
▪ Suppose thread 15 reads the halo before thread 0 has fetched it…
    int result = 0;
    result += temp[lindex + 1];       Load from temp[19]
                                                                                    83
__syncthreads()
• void __syncthreads();
                                                                           84
    Correct Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
  __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
  int gindex = threadIdx.x + blockIdx.x * blockDim.x;
  int lindex = threadIdx.x + RADIUS;
                                                87
Recap
   • Launching parallel threads
      • Launch N blocks with M threads per block with kernel<<<N,M>>>(…);
      • Use blockIdx.x to access block index within grid
      • Use threadIdx.x to access thread index within block
   • Allocate elements to threads:
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
                     Handling errors
MANAGING THE         Managing devices
DEVICE
                                               89
Coordinating Host & Device
   • Kernel launches are asynchronous
      • Control returns to the CPU immediately
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
                                                                 91
Device Management
    • Application can query and select GPUs
          cudaGetDeviceCount(int *count)
          cudaSetDevice(int device)
          cudaGetDevice(int *device)
          cudaGetDeviceProperties(cudaDeviceProp *prop, int
       device)
93