How to Compute This Fast?
•  Performing the same operations on many data items
                                                                                                      •  Example: SAXPY
                                                                                                                                               L1: ldf [X+r1]->f1 // I is in r1
                                                                                                for (I = 0; I < 1024; I++) {                       mulf f0,f1->f2 // A is in f0
           CIS 501: Computer Architecture                                                       }
                                                                                                  Z[I] = A*X[I] + Y[I];                            ldf [Y+r1]->f3
                                                                                                                                                   addf f2,f3->f4
                                                                                                                                                   stf f4->[Z+r1}
                                                                                                                                                   addi r1,4->r1
       Unit 11: Data-Level Parallelism:                                                                                                            blti r1,4096,L1
                Vectors & GPUs                                                                   •  Instruction-level parallelism (ILP) - fine grained
                                                                                                      •  Loop unrolling with static scheduling –or– dynamic scheduling
           Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania''             •  Wide-issue superscalar (non-)scaling limits benefits
                      with'sources'that'included'University'of'Wisconsin'slides'
                         by'Mark'Hill,'Guri'Sohi,'Jim'Smith,'and'David'Wood'                     •  Thread-level parallelism (TLP) - coarse grained
                                                                                                      •  Multicore
                                                                                                 •  Can we do some “medium grained” parallelism?
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                                   1    CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                 2
Data-Level Parallelism                                                                           Example Vector ISA Extensions (SIMD)
•  Data-level parallelism (DLP)                                                                  •  Extend ISA with floating point (FP) vector storage …
     •  Single operation repeated on multiple data elements                                           •  Vector register: fixed-size array of 32- or 64- bit FP elements
          •  SIMD (Single-Instruction, Multiple-Data)                                                 •  Vector length: For example: 4, 8, 16, 64, …
     •  Less general than ILP: parallel insns are all same operation                             •  … and example operations for vector length of 4
     •  Exploit with vectors                                                                          •  Load vector: ldf.v [X+r1]->v1
•  Old idea: Cray-1 supercomputer from late 1970s                                                         ldf [X+r1+0]->v10
     •  Eight 64-entry x 64-bit floating point “vector registers”                                         ldf [X+r1+1]->v11
         •  4096 bits (0.5KB) in each register! 4KB for vector register file                              ldf [X+r1+2]->v12
     •  Special vector instructions to perform vector operations                                          ldf [X+r1+3]->v13
         •  Load vector, store vector (wide memory operation)                                         •  Add two vectors: addf.vv v1,v2->v3
         •  Vector+Vector or Vector+Scalar                                                                addf v1i,v2i->v3i (where i is 0,1,2,3)
              •  addition, subtraction, multiply, etc.                                                •  Add vector to scalar: addf.vs v1,f2,v3
         •  In Cray-1, each instruction specifies 64 operations!                                          addf v1i,f2->v3i (where i is 0,1,2,3)
     •  ALUs were expensive, so one operation per cycle (not parallel)                           •  Today’s vectors: short (128 or 256 bits), but fully parallel
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                                   3    CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                 4
Example Use of Vectors – 4-wide                                                           Vector Datapath & Implementatoin
             ldf [X+r1]->f1                                 ldf.v [X+r1]->v1
             mulf f0,f1->f2                                 mulf.vs v1,f0->v2             •  Vector insn. are just like normal insn… only “wider”
             ldf [Y+r1]->f3                                 ldf.v [Y+r1]->v3                   •    Single instruction fetch (no extra N2 checks)
             addf f2,f3->f4                                 addf.vv v2,v3->v4
             stf f4->[Z+r1]                                 stf.v v4,[Z+r1]                    •    Wide register read & write (not multiple ports)
             addi r1,4->r1                                  addi r1,16->r1
             blti r1,4096,L1                                blti r1,4096,L1                    •    Wide execute: replicate floating point unit (same as superscalar)
              7x1024 instructions                             7x256 instructions               •    Wide bypass (avoid N2 bypass problem)
•  Operations                                               (4x fewer instructions)            •    Wide cache read & write (single cache tag check)
     •    Load vector: ldf.v [X+r1]->v1
     •    Multiply vector to scalar: mulf.vs v1,f2->v3                                    •  Execution width (implementation) vs vector width (ISA)
     •    Add two vectors: addf.vv v1,v2->v3                                                   •  Example: Pentium 4 and “Core 1” executes vector ops at half width
     •    Store vector: stf.v v1->[X+r1]                                                       •  “Core 2” executes them at full width
•  Performance?
                                                                                          •  Because they are just instructions…
     •  Best case: 4x speedup
                                                                                               •  …superscalar execution of vector instructions
     •  But, vector instructions don’t always have single-cycle throughput
                                                                                               •  Multiple n-wide vector instructions per cycle
         •  Execution width (implementation) vs vector width (ISA)
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                             5   CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                 6
Intel’s SSE2/SSE3/SSE4/AVX…                                                               Other Vector Instructions
•  Intel SSE2 (Streaming SIMD Extensions 2) - 2001                                        •  These target specific domains: e.g., image processing, crypto
     •  16 128bit floating point registers (xmm0–xmm15)                                        •    Vector reduction (sum all elements of a vector)
     •  Each can be treated as 2x64b FP or 4x32b FP (“packed FP”)                              •    Geometry processing: 4x4 translation/rotation matrices
         •  Or 2x64b or 4x32b or 8x16b or 16x8b ints (“packed integer”)                        •    Saturating (non-overflowing) subword add/sub: image processing
         •  Or 1x64b or 1x32b FP (just normal scalar floating point)                           •    Byte asymmetric operations: blending and composition in graphics
     •  Original SSE: only 8 registers, no packed integer support                              •    Byte shuffle/permute: crypto
                                                                                               •    Population (bit) count: crypto
•  Other vector extensions                                                                     •    Max/min/argmax/argmin: video codec
     •  AMD 3DNow!: 64b (2x32b)                                                                •    Absolute differences: video codec
     •  PowerPC AltiVEC/VMX: 128b (2x64b or 4x32b)                                             •    Multiply-accumulate: digital-signal processing
                                                                                               •    Special instructions for AES encryption
•  Looking forward for x86                                                                •  More advanced (but in Intel’s Xeon Phi)
     •  Intel’s “Sandy Bridge” brings 256-bit vectors to x86                                   •  Scatter/gather loads: indirect store (or load) from a vector of pointers
     •  Intel’s “Xeon Phi” multicore will bring 512-bit vectors to x86                         •  Vector mask: predication (conditional execution) of specific elements
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                             7   CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                 8
Using Vectors in Your Code                                                      Recap: Vectors for Exploiting DLP
•  Write in assembly                                                            •  Vectors are an efficient way of capturing parallelism
     •  Ugh                                                                          •    Data-level parallelism
                                                                                     •    Avoid the N2 problems of superscalar
•  Use “intrinsic” functions and data types
     •  For example: _mm_mul_ps() and “__m128” datatype
                                                                                     •    Avoid the difficult fetch problem of superscalar
                                                                                     •    Area efficient, power efficient
•  Use vector data types
     •  typedef double v2df __attribute__ ((vector_size (16)));                 •  The catch?
•  Use a library someone else wrote                                                  •  Need code that is “vector-izable”
     •  Let them do the hard work                                                    •  Need to modify program (unlike dynamic-scheduled superscalar)
     •  Matrix and linear algebra packages                                           •  Requires some help from the programmer
•  Let the compiler do it (automatic vectorization, with feedback)              •  Looking forward: Intel “Xeon Phi” (aka Larrabee) vectors
     •  GCC’s “-ftree-vectorize” option, -ftree-vectorizer-verbose=n
                                                                                     •  More flexible (vector “masks”, scatter, gather) and wider
     •  Limited impact for C/C++ code (old, hard problem)
                                                                                     •  Should be easier to exploit, more bang for the buck
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                  9    CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs               10
Graphics Processing Units (GPU)                                                 GPUs and SIMD/Vector Data Parallelism
•  Killer app for parallelism: graphics (3D games)
                                                                                •  How do GPUs have such high peak FLOPS & FLOPS/Joule?
                                                                                     •  Exploit massive data parallelism – focus on total throughput
                                                                                     •  Remove hardware structures that accelerate single threads
                                                                                     •  Specialized for graphs: e.g., data-types & dedicated texture units
                                                                                •  “SIMT” execution model
                                                                                     •  Single instruction multiple threads
                                                                                     •  Similar to both “vectors” and “SIMD”
                                                                  Tesla S870!        •  A key difference: better support for conditional control flow
                                                                                •  Program it with CUDA or OpenCL
                                                                                     •  Extensions to C
                                                                                     •  Perform a “shader task” (a snippet of scalar computation) over
                                                                                        many elements
                                                                                     •  Internally, GPU uses scatter/gather and vector mask operations
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs                 11    CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs               12
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
                                                                                                           13                                                                                                                    14
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
                                                                                                           15                                                                                                                    16
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
                                                                                                           17                                                                                                                    18
Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf   Slide by Kayvon Fatahalian - http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
                                                                                                           19                                                                                                                    20
Data Parallelism Summary
•  Data Level Parallelism
     •  “medium-grained” parallelism between ILP and TLP
     •  Still one flow of execution (unlike TLP)
     •  Compiler/programmer must explicitly expresses it (unlike ILP)
•  Hardware support: new “wide” instructions (SIMD)
     •  Wide registers, perform multiple operations in parallel
•  Trends
     •  Wider: 64-bit (MMX, 1996), 128-bit (SSE2, 2000),
        256-bit (AVX, 2011), 512-bit (Xeon Phi, 2012?)
     •  More advanced and specialized instructions
•  GPUs
     •  Embrace data parallelism via “SIMT” execution model
     •  Becoming more programmable all the time
•  Today’s chips exploit parallelism at all levels: ILP, DLP, TLP
CIS 501: Comp. Arch. | Prof. Milo Martin | Vectors & GPUs               21