KEMBAR78
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture | PDF
0% found this document useful (0 votes)
15 views73 pages

Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture

Uploaded by

Manish sutradhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views73 pages

Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture

Uploaded by

Manish sutradhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Digital Design & Computer Arch.

Lecture 20: Graphics Processing Units

Prof. Onur Mutlu

ETH Zürich
Spring 2020
8 May 2020
We Are Almost Done With This…
◼ Single-cycle Microarchitectures

◼ Multi-cycle and Microprogrammed Microarchitectures

◼ Pipelining

◼ Issues in Pipelining: Control & Data Dependence Handling,


State Maintenance and Recovery, …

◼ Out-of-Order Execution

◼ Other Execution Paradigms


2
Approaches to (Instruction-Level) Concurrency
◼ Pipelining
◼ Out-of-order execution
◼ Dataflow (at the ISA level)
◼ Superscalar Execution
◼ VLIW
◼ Systolic Arrays
◼ Decoupled Access Execute
◼ Fine-Grained Multithreading
◼ SIMD Processing (Vector and array processors, GPUs)

3
Readings for this Week
◼ Required
◼ Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
Computing Architecture," IEEE Micro 2008.

◼ Recommended
❑ Peleg and Weiser, “MMX Technology Extension to the Intel
Architecture,” IEEE Micro 1996.

4
SIMD Processing:
Exploiting Regular (Data) Parallelism
Recall: Flynn’s Taxonomy of Computers
◼ Mike Flynn, “Very High-Speed Computing Systems,” Proc.
of IEEE, 1966

◼ SISD: Single instruction operates on single data element


◼ SIMD: Single instruction operates on multiple data elements
❑ Array processor
❑ Vector processor
◼ MISD: Multiple instructions operate on single data element
❑ Closest form: systolic array processor, streaming processor
◼ MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
❑ Multiprocessor
❑ Multithreaded processor
6
Recall: SIMD Processing
◼ Single instruction operates on multiple data elements
❑ In time or in space
◼ Multiple processing elements

◼ Time-space duality

❑ Array processor: Instruction operates on multiple data


elements at the same time using different spaces

❑ Vector processor: Instruction operates on multiple data


elements in consecutive time steps using the same space

7
Recall: Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR

Instruction Stream Same op @ same time


Different ops @ time
LD VR  A[3:0] LD0 LD1 LD2 LD3 LD0
ADD VR  VR, 1 AD0 AD1 AD2 AD3 LD1 AD0
MUL VR  VR, 2
ST A[3:0]  VR MU0 MU1 MU2 MU3 LD2 AD1 MU0
ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0
Different ops @ same space AD3 MU2 ST1
MU3 ST2
Time Same op @ space ST3

Space Space

8
Recall: Memory Banking
◼ Memory is divided into banks that can be accessed independently;
banks share address and data buses (to minimize pin cost)
◼ Can start and complete one bank access per cycle
◼ Can sustain N parallel accesses if all N go to different banks

Bank Bank Bank Bank


0 1 2 15

MDR MAR MDR MAR MDR MAR MDR MAR

Data bus

Address bus

CPU
Picture credit: Derek Chiou 9
Recall: Vector Instruction Execution
VADD A,B → C

Execution using Execution using


one pipelined four pipelined
functional unit functional units

A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

Time Time
C[0] C[0] C[1] C[2] C[3]
Space
Slide credit: Krste Asanovic 10
Recall: Vector Unit Structure
Functional Unit

Partitioned
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …

Lane

Memory Subsystem

Slide credit: Krste Asanovic 11


Recall: Vector Instruction Level Parallelism
Can overlap execution of multiple vector instructions
❑ Example machine has 32 elements per vector register and 8 lanes
❑ Completes 24 operations/cycle while issuing 1 vector instruction/cycle

Load Unit Multiply Unit Add Unit


load
mul
add
time
load
mul
add

Instruction
issue

Slide credit: Krste Asanovic 12


Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code Vectorized Code

load load load

Iter. 1 load load load

add Time add add

store store store

load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction

add
Vectorization is a compile-time reordering of
operation sequencing
 requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 13
Vector/SIMD Processing Summary
◼ Vector/SIMD machines are good at exploiting regular data-
level parallelism
❑ Same operation performed on many data elements
❑ Improve performance, simplify design (no intra-vector
dependencies)

◼ Performance improvement limited by vectorizability of code


❑ Scalar operations limit vector machine performance
❑ Remember Amdahl’s Law
❑ CRAY-1 was the fastest SCALAR machine at its time!

◼ Many existing ISAs include (vector-like) SIMD operations


❑ Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
14
SIMD Operations in Modern ISAs
SIMD ISA Extensions
◼ Single Instruction Multiple Data (SIMD) extension
instructions
❑ Single instruction acts on multiple pieces of data at once
❑ Common application: graphics
❑ Perform short arithmetic operations (also called packed
arithmetic)
◼ For example: add four 8-bit numbers
◼ Must modify ALU to eliminate carries between 8-bit values
padd8 $s2, $s0, $s1
32 24 23 16 15 8 7 0 Bit position

a3 a2 a1 a0 $s0

+ b3 b2 b1 b0 $s1

a3 + b3 a2 + b2 a1 + b1 a0 + b0 $s2

16
Intel Pentium MMX Operations
◼ Idea: One instruction operates on multiple data elements
simultaneously
❑ À la array processing (yet much more limited)

❑ Designed with multimedia (graphics) operations in mind


No VLEN register
Opcode determines data type:
8 8-bit bytes
4 16-bit words
2 32-bit doublewords
1 64-bit quadword

Stride is always equal to 1.

Peleg and Weiser, “MMX Technology


Extension to the Intel Architecture,”
IEEE Micro, 1996.
17
MMX Example: Image Overlaying (I)
◼ Goal: Overlay the human in image 1 on top of the background in image 2

Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996. 18
MMX Example: Image Overlaying (II)
Y = Blossom image X = Woman’s image

Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996. 19
Fine-Grained Multithreading

20
Recall: Fine-Grained Multithreading
◼ Idea: Hardware has multiple thread contexts (PC+registers).
Each cycle, fetch engine fetches from a different thread.
❑ By the time the fetched branch/instruction resolves, no
instruction is fetched from the same thread
❑ Branch/instruction resolution latency overlapped with execution
of other threads’ instructions

+ No logic needed for handling control and


data dependences within a thread
-- Single thread performance suffers
-- Extra logic for keeping thread contexts
-- Does not overlap latency if not enough
threads to cover the whole pipeline
21
Recall: Fine-Grained Multithreading (II)
◼ Idea: Switch to another thread every cycle such that no two
instructions from a thread are in the pipeline concurrently

◼ Tolerates the control and data dependency latencies by


overlapping the latency with useful work from other threads
◼ Improves pipeline utilization by taking advantage of multiple
threads

◼ Thornton, “Parallel Operation in the Control Data 6600,” AFIPS


1964.
◼ Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978.

22
Recall: Multithreaded Pipeline Example

Slide credit: Joel Emer 23


Recall: Fine-grained Multithreading
◼ Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions from
different threads
+ Improved system throughput, latency tolerance, utilization

◼ Disadvantages
- Extra hardware complexity: multiple hardware contexts (PCs, register
files, …), thread selection logic
- Reduced single thread performance (one instruction fetched every N
cycles from the same thread)
- Resource contention between threads in caches and memory
- Some dependency checking logic between threads remains (load/store)
24
GPUs (Graphics Processing Units)
GPUs are SIMD Engines Underneath
◼ The instruction pipeline operates like a SIMD pipeline (e.g.,
an array processor)

◼ However, the programming is done using threads, NOT


SIMD instructions

◼ To understand this, let’s go back to our parallelizable code


example

◼ But, before that, let’s distinguish between


❑ Programming Model (Software)
vs.
❑ Execution Model (Hardware)
26
Programming Model vs. Hardware Execution Model
◼ Programming Model refers to how the programmer expresses
the code
❑ E.g., Sequential (von Neumann), Data Parallel (SIMD), Dataflow,
Multi-threaded (MIMD, SPMD), …

◼ Execution Model refers to how the hardware executes the


code underneath
❑ E.g., Out-of-order execution, Vector processor, Array processor,
Dataflow processor, Multiprocessor, Multithreaded processor, …

◼ Execution Model can be very different from the Programming


Model
❑ E.g., von Neumann model implemented by an OoO processor
❑ E.g., SPMD model implemented by a SIMD processor (a GPU)
27
How Can You Exploit Parallelism Here?
for (i=0; i < N; i++)
Scalar Sequential Code C[i] = A[i] + B[i];

load

Iter. 1 load
Let’s examine three programming
add
options to exploit instruction-level
parallelism present in this sequential
store code:

load
1. Sequential (SISD)
Iter. 2 load

add 2. Data-Parallel (SIMD)

store
3. Multithreaded (MIMD/SPMD)
28
for (i=0; i < N; i++)
Prog. Model 1: Sequential (SISD) C[i] = A[i] + B[i];

Scalar Sequential Code ◼ Can be executed on a:

load
◼ Pipelined processor
Iter. 1 load
◼ Out-of-order execution processor
add ❑ Independent instructions executed
when ready
store ❑ Different iterations are present in the
instruction window and can execute in
load
parallel in multiple functional units
Iter. 2 load ❑ In other words, the loop is dynamically
unrolled by the hardware
add
◼ Superscalar or VLIW processor
store ❑ Can fetch and execute multiple
instructions per cycle
29
for (i=0; i < N; i++)
Prog. Model 2: Data Parallel (SIMD) C[i] = A[i] + B[i];

Scalar Sequential Code Vector Instruction Vectorized Code

load
load load VLD A → V1

Iter. 1 load
load load VLD B → V2

add
add add VADD V1 + V2 → V3

store store VST V3 → C

load
Iter. Iter.
Realization: Each iteration is independent
Iter. 2 1 load 2

Idea: Programmer or compiler generates a SIMD


add
instruction to execute the same instruction from
all iterations across different data
store
Best executed by a SIMD processor (vector, array)
30
for (i=0; i < N; i++)
Prog. Model 3: Multithreaded C[i] = A[i] + B[i];

Scalar Sequential Code

load
load load

Iter. 1 load
load load

add
add add

store store

load
Iter. Iter.
Realization: Each iteration is independent
Iter. 2 1 load 2

Idea: Programmer or compiler generates a thread


add
to execute each iteration. Each thread does the
same thing (but on different data)
store
Can be executed on a MIMD machine
31
for (i=0; i < N; i++)
Prog. Model 3: Multithreaded C[i] = A[i] + B[i];

load load

load load

add add

store store

Iter. Iter.
1 2 Realization: Each iteration is independent

Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
to execute each iteration. Each thread does the
same thing (but
SPMD: on Program
Single different data)
Multiple Data

CanCan be
be executed
be executed
Can on
on a
a SIMT
on a MIMD
executed machine
machine
SIMD machine
Single Instruction Multiple Thread 32
A GPU is a SIMD (SIMT) Machine
◼ Except it is not programmed using SIMD instructions

◼ It is programmed using threads (SPMD programming model)


❑ Each thread executes the same code but operates a different
piece of data
❑ Each thread has its own context (i.e., can be
treated/restarted/executed independently)

◼ A set of threads executing the same instruction are


dynamically grouped into a warp (wavefront) by the
hardware
❑ A warp is essentially a SIMD operation formed by hardware!

33
for (i=0; i < N; i++)
SPMD on SIMT Machine C[i] = A[i] + B[i];

load load Warp 0 at PC X

load load Warp 0 at PC X+1

add add Warp 0 at PC X+2

store store Warp 0 at PC X+3

Iter. Iter.
1 2 Warp: A set of threads that execute
Realization: Each iteration is independent
the same instruction (i.e., at the same PC)
Idea:This
Programmer
particularormodel
compiler generates
is also a thread
called:
to execute each iteration. Each thread does the
same thing Single
SPMD: (but on Program
different data)
Multiple Data
A GPU
Can executes
be executed
Can be onitausing
executed MIMD the SIMT
machine
on a SIMD model:
machine
Single Instruction Multiple Thread 34
Graphics Processing Units
SIMD not Exposed to Programmer (SIMT)
SIMD vs. SIMT Execution Model
◼ SIMD: A single sequential instruction stream of SIMD
instructions → each instruction specifies multiple data inputs
❑ [VLD, VLD, VADD, VST], VLEN

◼ SIMT: Multiple instruction streams of scalar instructions →


threads grouped dynamically into warps
❑ [LD, LD, ADD, ST], NumThreads

◼ Two Major SIMT Advantages:


❑ Can treat each thread separately → i.e., can execute each thread
independently (on any type of scalar pipeline) → MIMD processing
❑ Can group threads into warps flexibly → i.e., can group threads
that are supposed to truly execute the same instruction →
dynamically obtain and maximize benefits of SIMD processing
36
Fine-Grained Multithreading of for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Warps
◼ Assume a warp consists of 32 threads
◼ If you have 32K iterations, and 1 iteration/thread → 1K warps
◼ Warps can be interleaved on the same pipeline → Fine grained
multithreading of warps

load load 0 at PC X
Warp 1

load load

add add Warp 20 at PC X+2

store store

Iter.
Iter. Iter.
Iter.
1
33
20*32 + 1 2
34
20*32 +2
37
Warps and Warp-Level FGMT
◼ Warp: A set of threads that execute the same instruction
(on different data elements) → SIMT (Nvidia-speak)
◼ All threads run the same code
◼ Warp: The threads that run lengthwise in a woven fabric …

Thread Warp 3
Thread Warp 8
Thread Warp Common PC
Scalar Scalar Scalar Scalar Thread Warp 7
ThreadThread Thread Thread
W X Y Z
SIMD Pipeline

38
Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008.
High-Level View of a GPU

39
Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008.
Latency Hiding via Warp-Level FGMT
◼ Warp: A set of threads that
execute the same instruction
Warps available
(on different data elements) Thread Warp 3
Thread Warp 8 for scheduling

Thread Warp 7
SIMD Pipeline
◼ Fine-grained multithreading
I-Fetch
❑ One instruction per thread in
pipeline at a time (No Decode

interlocking)

RF
RF

RF
❑ Interleave warp execution to Warps accessing

ALU

ALU

ALU
memory hierarchy
hide latencies Miss?
◼ Register values of all threads stay D-Cache Thread Warp 1
in register file All Hit? Data Thread Warp 2

◼ FGMT enables long latency Thread Warp 6


Writeback
tolerance
❑ Millions of pixels
Slide credit: Tor Aamodt 40
Warp Execution (Recall the Slide)
32-thread warp executing ADD A[tid],B[tid] → C[tid]

Execution using Execution using


one pipelined four pipelined
functional unit functional units

A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

Time Time
C[0] C[0] C[1] C[2] C[3]
Space
Slide credit: Krste Asanovic 41
SIMD Execution Unit Structure
Functional Unit

Registers
for each
Thread Registers for Registers for Registers for Registers for
thread IDs thread IDs thread IDs thread IDs
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …

Lane

Memory Subsystem

Slide credit: Krste Asanovic 42


Warp Instruction Level Parallelism
Can overlap execution of multiple instructions
❑ Example machine has 32 threads per warp and 8 lanes
❑ Completes 24 operations/cycle while issuing 1 warp/cycle

Load Unit Multiply Unit Add Unit


W0
W1
W2
time
W3
W4
W5

Warp issue

Slide credit: Krste Asanovic 43


SIMT Memory Access
◼ Same instruction in different threads uses thread id to
index and access different data elements

Let’s assume N=16, 4 threads per warp → 4 warps


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Threads
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Data elements

+ + + +

Warp 0 Warp 1 Warp 2 Warp 3

Slide credit: Hyesoon Kim 44


Warps not Exposed to GPU Programmers
◼ CPU threads and GPU kernels
❑ Sequential or modestly parallel sections on CPU
❑ Massively parallel sections on GPU: Blocks of threads

Serial Code (host)

Parallel Kernel (device)


KernelA<<< nBlk, nThr >>>(args); ...

Serial Code (host)

Parallel Kernel (device)


KernelB<<< nBlk, nThr >>>(args); ...

Slide credit: Hwu & Kirk

45
Sample GPU SIMT Code (Simplified)

CPU code
for (ii = 0; ii < 100000; ++ii) {
C[ii] = A[ii] + B[ii];
}

CUDA code
// there are 100000 threads
__global__ void KernelFunction(…) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
int varA = aa[tid];
int varB = bb[tid];
C[tid] = varA + varB;
}

Slide credit: Hyesoon Kim 46


Sample GPU Program (Less Simplified)

Slide credit: Hyesoon Kim 47


From Blocks to Warps
◼ GPU cores: SIMD pipelines
❑ Streaming Multiprocessors (SM)
❑ Streaming Processors (SP)

◼ Blocks are divided into warps


❑ SIMD unit (32 threads)

Block 0’s warps Block 1’s warps Block 2’s warps

… … …
t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31
… … …

NVIDIA Fermi architecture


48
Warp-based SIMD vs. Traditional SIMD
◼ Traditional SIMD contains a single thread
❑ Sequential instruction execution; lock-step operations in a SIMD instruction
❑ Programming model is SIMD (no extra threads) → SW needs to know
vector length
❑ ISA contains vector/SIMD instructions

◼ Warp-based SIMD consists of multiple scalar threads executing in a


SIMD manner (i.e., same instruction executed by all threads)
❑ Does not have to be lock step
❑ Each thread can be treated individually (i.e., placed in a different warp)
→ programming model not SIMD
◼ SW does not need to know vector length

◼ Enables multithreading and flexible dynamic grouping of threads

❑ ISA is scalar → SIMD operations can be formed dynamically


❑ Essentially, it is SPMD programming model implemented on SIMD
hardware
49
SPMD
◼ Single procedure/program, multiple data
❑ This is a programming model rather than computer organization

◼ Each processing element executes the same procedure, except on


different data elements
❑ Procedures can synchronize at certain points in program, e.g. barriers

◼ Essentially, multiple instruction streams execute the same


program
❑ Each program/procedure 1) works on different data, 2) can execute a
different control-flow path, at run-time
❑ Many scientific applications are programmed this way and run on MIMD
hardware (multiprocessors)
❑ Modern GPUs programmed in a similar way on a SIMD hardware

50
SIMD vs. SIMT Execution Model
◼ SIMD: A single sequential instruction stream of SIMD
instructions → each instruction specifies multiple data inputs
❑ [VLD, VLD, VADD, VST], VLEN

◼ SIMT: Multiple instruction streams of scalar instructions →


threads grouped dynamically into warps
❑ [LD, LD, ADD, ST], NumThreads

◼ Two Major SIMT Advantages:


❑ Can treat each thread separately → i.e., can execute each thread
independently on any type of scalar pipeline → MIMD processing
❑ Can group threads into warps flexibly → i.e., can group threads
that are supposed to truly execute the same instruction →
dynamically obtain and maximize benefits of SIMD processing
51
Threads Can Take Different Paths in Warp-based SIMD

◼ Each thread can have conditional control flow instructions


◼ Threads can execute different control flow paths

Thread Warp Common PC


B
Thread Thread Thread Thread
C D F 1 2 3 4

Slide credit: Tor Aamodt 52


Control Flow Problem in GPUs/SIMT
◼ A GPU uses a SIMD
pipeline to save area
on control logic
❑ Groups scalar threads
into warps Branch

Path A
◼ Branch divergence
occurs when threads Path B
inside warps branch to
different execution
paths
This is the same as conditional/predicated/masked execution.
Recall the Vector Mask and Masked Vector Operations?

Slide credit: Tor Aamodt 53


Remember: Each Thread Is Independent
◼ Two Major SIMT Advantages:
❑ Can treat each thread separately → i.e., can execute each thread
independently on any type of scalar pipeline → MIMD processing
❑ Can group threads into warps flexibly → i.e., can group threads
that are supposed to truly execute the same instruction →
dynamically obtain and maximize benefits of SIMD processing

◼ If we have many threads


◼ We can find individual threads that are at the same PC
◼ And, group them together into a single warp dynamically
◼ This reduces “divergence” → improves SIMD utilization
❑ SIMD utilization: fraction of SIMD lanes executing a useful
operation (i.e., executing an active thread)
54
Dynamic Warp Formation/Merging
◼ Idea: Dynamically merge threads executing the same
instruction (after branch divergence)
◼ Form new warps from warps that are waiting
❑ Enough threads branching to each path enables the creation
of full new warps

Warp X Warp Z
Warp Y

55
Dynamic Warp Formation/Merging
◼ Idea: Dynamically merge threads executing the same
instruction (after branch divergence)

Branch

Path A

Path B

◼ Fung et al., “Dynamic Warp Formation and Scheduling for


Efficient GPU Control Flow,” MICRO 2007.
56
Dynamic Warp Formation Example
x/1111
A y/1111
Legend
x/1110 A A
B y/0011 Execution of Warp x Execution of Warp y
at Basic Block A at Basic Block A
x/1000 x/0110 x/0001
C y/0010 D y/0001 F y/1100
D
A new warp created from scalar
x/1110 threads of both Warp x and y
E y/0011 executing at Basic Block D

x/1111
G y/1111
A A B B C C D D E E F F G G A A

Baseline
Time
Dynamic A A B B C D E E F G G A A
Warp
Formation
Time

Slide credit: Tor Aamodt 57


Hardware Constraints Limit Flexibility of Warp Grouping
Functional Unit

Registers
for each
Thread Registers for Registers for Registers for Registers for
thread IDs thread IDs thread IDs thread IDs
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …

Can you move any thread


flexibly to any lane?
Lane

Memory Subsystem

Slide credit: Krste Asanovic 58


Large Warps and Two-Level Warp Scheduling
◼ Two main reasons for GPU resources be underutilized

❑ Branch divergence

❑ Long latency operations

Core All Warps Compute All Warps Compute


Req Warp 0
Memory Req Warp 1

System
Req Warp 15
time
Round Robin Scheduling, 16 total warps

Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp
Scheduling,” MICRO 2011. 59
Large Warp Microarchitecture Example
◼ Reduce branch divergence by having large warps
◼ Dynamically break down a large warp into sub-warps
Decode Stage

0
1 0 0 0
1
0 1
0 0 0
Sub-warp 1
0 mask Sub-warp 1
2 0 mask Sub-warp 0 mask
0 0 1
0 1
0
1
0 0 0 0 1 1 0
1 1 1 1 1 1 1 1 1 1
0 0 1
0 0
0 1
0 0 0
1
0 0 0 1
0
0 1
0 0 0

Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp
Scheduling,” MICRO 2011.
Two-Level Round Robin
◼ Scheduling in two levels to deal with long latency operations
Core All Warps Compute All Warps Compute
Req Warp 0
Memory Req Warp 1

System
Req Warp 15
time
Round Robin Scheduling, 16 total warps

Group 0 Group 1 Group 0 Group 1


Core Compute Compute Compute Compute
Saved Cycles
Req Warp 0
Req Warp 1

Req Warp 7
Memory
System Req Warp 8
Req Warp 9

Req Warp 15
time
Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each
Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp
Scheduling,” MICRO 2011.
An Example GPU
NVIDIA GeForce GTX 285
◼ NVIDIA-speak:
❑ 240 stream processors

❑ “SIMT execution”

◼ Generic speak:
❑ 30 cores

❑ 8 SIMD functional units per core

Slide credit: Kayvon Fatahalian 63


NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

= SIMD functional unit, control = instruction stream decode


shared across 8 units
= multiply-add = execution context storage
= multiply

Slide credit: Kayvon Fatahalian 64


NVIDIA GeForce GTX 285 “core”

64 KB of storage
… for thread contexts
(registers)

◼ Groups of 32 threads share instruction stream (each group is


a Warp)
◼ Up to 32 warps are simultaneously interleaved
◼ Up to 1024 thread contexts can be stored
Slide credit: Kayvon Fatahalian 65
NVIDIA GeForce GTX 285

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

Tex Tex
… … … … … …

30 cores on the GTX 285: 30,720 threads


Slide credit: Kayvon Fatahalian 66
Evolution of NVIDIA GPUs

6000 16000

14000
5000
12000
#Stream Processors

4000
10000

GFLOPS
3000 8000 Stream Processors
6000 GFLOPS
2000
4000
1000
2000

0 0
GTX 285 GTX 480 GTX 780 GTX 980 P100 V100
(2009) (2010) (2013) (2014) (2016) (2017)

67
NVIDIA V100
◼ NVIDIA-speak:
❑ 5120 stream processors

❑ “SIMT execution”

◼ Generic speak:
❑ 80 cores

❑ 64 SIMD functional units per core

❑ Tensor cores for Machine Learning

◼ NVIDIA, “NVIDIA Tesla V100 GPU Architecture. White Paper,” 2017.

68
NVIDIA V100 Block Diagram

https://devblogs.nvidia.com/inside-volta/

80 cores on the V100


69
NVIDIA V100 Core

15.7 TFLOPS Single Precision


7.8 TFLOPS Double Precision
125 TFLOPS for Deep Learning (Tensor cores)

https://devblogs.nvidia.com/inside-volta/

70
Food for Thought
◼ Compare and contrast GPUs vs Systolic Arrays

❑ Which one is better for machine learning?

❑ Which one is better for image/vision processing?

❑ What types of parallelism each one exploits?

❑ What are the tradeoffs?

◼ If you are interested in such questions and more…


❑ Bachelor’s Seminar in Computer Architecture (HS2019,
FS2020)
❑ Computer Architecture Master’s Course (HS2019)

71
Digital Design & Computer Arch.
Lecture 20: Graphics Processing Units

Prof. Onur Mutlu

ETH Zürich
Spring 2020
8 May 2020
Clarification of some GPU Terms
Generic Term NVIDIA Term AMD Term Comments

Vector length Warp size Wavefront size Number of threads that run in parallel (lock-step)
on a SIMD functional unit
Pipelined Streaming - Functional unit that executes instructions for one
functional unit / processor / GPU thread
Scalar pipeline CUDA core
SIMD functional Group of N Vector ALU SIMD functional unit that executes instructions for
unit / streaming an entire warp
SIMD pipeline processors (e.g.,
N=8 in GTX 285,
N=16 in Fermi)
GPU core Streaming Compute unit It contains one or more warp schedulers and one
multiprocessor or several SIMD pipelines

73

You might also like