Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
ETH Zürich
Spring 2020
8 May 2020
We Are Almost Done With This…
◼ Single-cycle Microarchitectures
◼ Pipelining
◼ Out-of-Order Execution
3
Readings for this Week
◼ Required
◼ Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
Computing Architecture," IEEE Micro 2008.
◼ Recommended
❑ Peleg and Weiser, “MMX Technology Extension to the Intel
Architecture,” IEEE Micro 1996.
4
SIMD Processing:
Exploiting Regular (Data) Parallelism
Recall: Flynn’s Taxonomy of Computers
◼ Mike Flynn, “Very High-Speed Computing Systems,” Proc.
of IEEE, 1966
◼ Time-space duality
7
Recall: Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
8
Recall: Memory Banking
◼ Memory is divided into banks that can be accessed independently;
banks share address and data buses (to minimize pin cost)
◼ Can start and complete one bank access per cycle
◼ Can sustain N parallel accesses if all N go to different banks
Data bus
Address bus
CPU
Picture credit: Derek Chiou 9
Recall: Vector Instruction Execution
VADD A,B → C
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Time Time
C[0] C[0] C[1] C[2] C[3]
Space
Slide credit: Krste Asanovic 10
Recall: Vector Unit Structure
Functional Unit
Partitioned
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
Lane
Memory Subsystem
Instruction
issue
load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
add
Vectorization is a compile-time reordering of
operation sequencing
requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 13
Vector/SIMD Processing Summary
◼ Vector/SIMD machines are good at exploiting regular data-
level parallelism
❑ Same operation performed on many data elements
❑ Improve performance, simplify design (no intra-vector
dependencies)
a3 a2 a1 a0 $s0
+ b3 b2 b1 b0 $s1
a3 + b3 a2 + b2 a1 + b1 a0 + b0 $s2
16
Intel Pentium MMX Operations
◼ Idea: One instruction operates on multiple data elements
simultaneously
❑ À la array processing (yet much more limited)
Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996. 18
MMX Example: Image Overlaying (II)
Y = Blossom image X = Woman’s image
Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996. 19
Fine-Grained Multithreading
20
Recall: Fine-Grained Multithreading
◼ Idea: Hardware has multiple thread contexts (PC+registers).
Each cycle, fetch engine fetches from a different thread.
❑ By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
❑ Branch/instruction resolution latency overlapped with execution
of other threads’ instructions
22
Recall: Multithreaded Pipeline Example
◼ Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains (load/store)
24
GPUs (Graphics Processing Units)
GPUs are SIMD Engines Underneath
◼ The instruction pipeline operates like a SIMD pipeline (e.g.,
an array processor)
load
Iter. 1 load
Let’s examine three programming
add
options to exploit instruction-level
parallelism present in this sequential
store code:
load
1. Sequential (SISD)
Iter. 2 load
store
3. Multithreaded (MIMD/SPMD)
28
for (i=0; i < N; i++)
Prog. Model 1: Sequential (SISD) C[i] = A[i] + B[i];
load
◼ Pipelined processor
Iter. 1 load
◼ Out-of-order execution processor
add ❑ Independent instructions executed
when ready
store ❑ Different iterations are present in the
instruction window and can execute in
load
parallel in multiple functional units
Iter. 2 load ❑ In other words, the loop is dynamically
unrolled by the hardware
add
◼ Superscalar or VLIW processor
store ❑ Can fetch and execute multiple
instructions per cycle
29
for (i=0; i < N; i++)
Prog. Model 2: Data Parallel (SIMD) C[i] = A[i] + B[i];
load
load load VLD A → V1
Iter. 1 load
load load VLD B → V2
add
add add VADD V1 + V2 → V3
load
Iter. Iter.
Realization: Each iteration is independent
Iter. 2 1 load 2
load
load load
Iter. 1 load
load load
add
add add
store store
load
Iter. Iter.
Realization: Each iteration is independent
Iter. 2 1 load 2
load load
load load
add add
store store
Iter. Iter.
1 2 Realization: Each iteration is independent
Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
to execute each iteration. Each thread does the
same thing (but
SPMD: on Program
Single different data)
Multiple Data
CanCan be
be executed
be executed
Can on
on a
a SIMT
on a MIMD
executed machine
machine
SIMD machine
Single Instruction Multiple Thread 32
A GPU is a SIMD (SIMT) Machine
◼ Except it is not programmed using SIMD instructions
33
for (i=0; i < N; i++)
SPMD on SIMT Machine C[i] = A[i] + B[i];
Iter. Iter.
1 2 Warp: A set of threads that execute
Realization: Each iteration is independent
the same instruction (i.e., at the same PC)
Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
to execute each iteration. Each thread does the
same thing Single
SPMD: (but on Program
different data)
Multiple Data
A GPU
Can executes
be executed
Can be onitausing
executed MIMD the SIMT
machine
on a SIMD model:
machine
Single Instruction Multiple Thread 34
Graphics Processing Units
SIMD not Exposed to Programmer (SIMT)
SIMD vs. SIMT Execution Model
◼ SIMD: A single sequential instruction stream of SIMD
instructions → each instruction specifies multiple data inputs
❑ [VLD, VLD, VADD, VST], VLEN
load load 0 at PC X
Warp 1
load load
store store
Iter.
Iter. Iter.
Iter.
1
33
20*32 + 1 2
34
20*32 +2
37
Warps and Warp-Level FGMT
◼ Warp: A set of threads that execute the same instruction
(on different data elements) → SIMT (Nvidia-speak)
◼ All threads run the same code
◼ Warp: The threads that run lengthwise in a woven fabric …
Thread Warp 3
Thread Warp 8
Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
ThreadThread Thread Thread
W X Y Z
SIMD Pipeline
38
Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008.
High-Level View of a GPU
39
Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008.
Latency Hiding via Warp-Level FGMT
◼ Warp: A set of threads that
execute the same instruction
Warps available
(on different data elements) Thread Warp 3
Thread Warp 8 for scheduling
Thread Warp 7
SIMD Pipeline
◼ Fine-grained multithreading
I-Fetch
❑ One instruction per thread in
pipeline at a time (No Decode
interlocking)
RF
RF
RF
❑ Interleave warp execution to Warps accessing
ALU
ALU
ALU
memory hierarchy
hide latencies Miss?
◼ Register values of all threads stay D-Cache Thread Warp 1
in register file All Hit? Data Thread Warp 2
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Time Time
C[0] C[0] C[1] C[2] C[3]
Space
Slide credit: Krste Asanovic 41
SIMD Execution Unit Structure
Functional Unit
Registers
for each
Thread Registers for Registers for Registers for Registers for
thread IDs thread IDs thread IDs thread IDs
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Lane
Memory Subsystem
Warp issue
+ + + +
45
Sample GPU SIMT Code (Simplified)
CPU code
for (ii = 0; ii < 100000; ++ii) {
C[ii] = A[ii] + B[ii];
}
CUDA code
// there are 100000 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}
… … …
t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31
… … …
50
SIMD vs. SIMT Execution Model
◼ SIMD: A single sequential instruction stream of SIMD
instructions → each instruction specifies multiple data inputs
❑ [VLD, VLD, VADD, VST], VLEN
Path A
◼ Branch divergence
occurs when threads Path B
inside warps branch to
different execution
paths
This is the same as conditional/predicated/masked execution.
Recall the Vector Mask and Masked Vector Operations?
Warp X Warp Z
Warp Y
55
Dynamic Warp Formation/Merging
◼ Idea: Dynamically merge threads executing the same
instruction (after branch divergence)
Branch
Path A
Path B
x/1111
G y/1111
A A B B C C D D E E F F G G A A
Baseline
Time
Dynamic A A B B C D E E F G G A A
Warp
Formation
Time
Registers
for each
Thread Registers for Registers for Registers for Registers for
thread IDs thread IDs thread IDs thread IDs
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Memory Subsystem
❑ Branch divergence
System
Req Warp 15
time
Round Robin Scheduling, 16 total warps
Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp
Scheduling,” MICRO 2011. 59
Large Warp Microarchitecture Example
◼ Reduce branch divergence by having large warps
◼ Dynamically break down a large warp into sub-warps
Decode Stage
0
1 0 0 0
1
0 1
0 0 0
Sub-warp 1
0 mask Sub-warp 1
2 0 mask Sub-warp 0 mask
0 0 1
0 1
0
1
0 0 0 0 1 1 0
1 1 1 1 1 1 1 1 1 1
0 0 1
0 0
0 1
0 0 0
1
0 0 0 1
0
0 1
0 0 0
Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp
Scheduling,” MICRO 2011.
Two-Level Round Robin
◼ Scheduling in two levels to deal with long latency operations
Core All Warps Compute All Warps Compute
Req Warp 0
Memory Req Warp 1
System
Req Warp 15
time
Round Robin Scheduling, 16 total warps
Req Warp 7
Memory
System Req Warp 8
Req Warp 9
Req Warp 15
time
Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each
Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp
Scheduling,” MICRO 2011.
An Example GPU
NVIDIA GeForce GTX 285
◼ NVIDIA-speak:
❑ 240 stream processors
❑ “SIMT execution”
◼ Generic speak:
❑ 30 cores
64 KB of storage
… for thread contexts
(registers)
64 KB of storage
… for thread contexts
(registers)
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
Tex Tex
… … … … … …
6000 16000
14000
5000
12000
#Stream Processors
4000
10000
GFLOPS
3000 8000 Stream Processors
6000 GFLOPS
2000
4000
1000
2000
0 0
GTX 285 GTX 480 GTX 780 GTX 980 P100 V100
(2009) (2010) (2013) (2014) (2016) (2017)
67
NVIDIA V100
◼ NVIDIA-speak:
❑ 5120 stream processors
❑ “SIMT execution”
◼ Generic speak:
❑ 80 cores
68
NVIDIA V100 Block Diagram
https://devblogs.nvidia.com/inside-volta/
https://devblogs.nvidia.com/inside-volta/
70
Food for Thought
◼ Compare and contrast GPUs vs Systolic Arrays
71
Digital Design & Computer Arch.
Lecture 20: Graphics Processing Units
ETH Zürich
Spring 2020
8 May 2020
Clarification of some GPU Terms
Generic Term NVIDIA Term AMD Term Comments
Vector length Warp size Wavefront size Number of threads that run in parallel (lock-step)
on a SIMD functional unit
Pipelined Streaming - Functional unit that executes instructions for one
functional unit / processor / GPU thread
Scalar pipeline CUDA core
SIMD functional Group of N Vector ALU SIMD functional unit that executes instructions for
unit / streaming an entire warp
SIMD pipeline processors (e.g.,
N=8 in GTX 285,
N=16 in Fermi)
GPU core Streaming Compute unit It contains one or more warp schedulers and one
multiprocessor or several SIMD pipelines
73