0% found this document useful (0 votes)

326 views17 pages

GPU Architecture

The document discusses the history and architecture of GPUs. It describes how GPUs evolved from specialized graphics processors to general purpose parallel computing devices. Key developments include NVIDIA introducing unified shader architectures and programming frameworks like CUDA and OpenCL that allowed general computation on GPUs.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

326 views17 pages

GPU Architecture

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Who is this guy?

GPU Architecture
Analytical
Patrick Cozzi Graphics, Inc.

University of Pennsylvania developer lecturer author editor

CIS 371 Guest Lecture

Spring 2012
See http://www.seas.upenn.edu/~pcozzi/

How did this happen? Graphics Workloads

  Triangles/vertices and pixels/fragments

http://proteneer.com/blog/?p=263 Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html

Early 90s – Pre GPU Why GPUs?
  Graphics workloads are embarrassingly
parallel
 Data-parallel
 Pipeline-parallel

  CPU and GPU execute in parallel

  Hardware: texture filtering, rasterization,
etc.

Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf

Data Parallel NVIDIA GeForce 6 (2004)

6 vertex
shader processors
  Beyond Graphics
 Cloth simulation
 Particle system
 Matrix multiply

16 fragment
shader processors

Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306 Image from http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter30.html

NVIDIA G80 Architecture Why Unify Shader Processors?

Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Why Unify Shader Processors? GPU Architecture Big Ideas

  GPUs are specialized for
 Compute-intensive, highly parallel computation
 Graphics is just the beginning.

  Transistors are devoted to:

 Processing
 Not:
  Data caching
  Flow control

Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

NVIDIA G80

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

NVIDIA G80 NVIDIA G80

Streaming Processing (SP)

Streaming Multi-Processor (SM)

NVIDIA G80 NVIDIA GT200
  16 SMs   30 SMs
  Each with 8 SPs   Each with 8 SPs
  128 total SPs   240 total SPs
  Each SM hosts up   Each SM hosts
to 768 threads up to
  Up to 12,288   1024 threads
threads in flight   In flight, up to
  30,720 threads

Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

GPU Computing History

  2001/2002 – researchers see GPU as data-
Let’s program parallel coprocessor
 The GPGPU field is born

this thing!   2007

 CUDA
– NVIDIA releases CUDA
– Compute Uniform Device Architecture
 GPGPU shifts to GPU Computing

  2008 – Khronos releases OpenCL

specification
CUDA Abstractions CUDA Kernels
  Ahierarchy of thread groups   ExecutedN times in parallel by N different
  Shared memories CUDA threads
  Barrier synchronization

Thread ID

Declaration
Specifier

Execution
Configuration

CUDA Program Execution Thread Hierarchies

  Grid – one or more thread blocks
 1D or 2D
  Block – array of threads
 1D, 2D, or 3D
 Each block in a grid has the same number of
threads
 Each thread in a block can
  Synchronize

  Access shared memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies Thread Hierarchies
  Thread Block
 Group of threads
  G80 and GT200: Up to 512 threads
  Fermi: Up to 1024 threads

 Reside on same processor core

 Share memory of that core

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies Thread Hierarchies

  Threads in a block
 Share (limited) low-latency memory
 Synchronize execution
  Tocoordinate memory accesses
  __syncThreads()
  Barrier – threads in block wait until all threads reach this
  Lightweight

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Scheduling Threads Scheduling Threads
  Warp – threads from a block   Warps for three
blocks scheduled
 G80 / GT200 – 32 threads on the same SM.
 Run on the same SM
 Unit of thread scheduling
 Consecutive threadIdx values
 An implementation detail – in theory
  warpSize

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

Scheduling Threads Scheduling Threads

Remember this:

Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Scheduling Threads Scheduling Threads
  What happens if branches in a warp Remember this:
diverge?

Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Scheduling Threads Scheduling Threads

  32
threads per warp but 8 SPs per   32threads per warp but 8 SPs per
SM. What gives? SM. What gives?
  When an SM schedules a warp:
 Itsinstruction is ready
 8 threads enter the SPs on the 1st cycle
 8 more on the 2nd, 3rd, and 4th cycles
 Therefore, 4 cycles are required to
dispatch a warp
Scheduling Threads Scheduling Threads
  Question   Solution
 A kernel has  Each warp has 4 multiples/adds
  1 global memory read (200 cycles)   16 cycles
  4 non-dependent multiples/adds  We need to cover 200 cycles
 Howmany warps are required to hide the   200 / 16 = 12.5
memory latency?   ceil(12.5) = 13

 13 warps are required

Memory Model Thread Synchronization

Recall:   Threads in a block can synchronize
 call__syncthreads to create a barrier
 A thread waits at this call until all threads in
the block reach it, then all threads continue

Mds[i] = Md[j];
__syncthreads();
func(Mds[i], Mds[i + 1]);
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Synchronization Thread Synchronization
Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 0 Time: 1

Thread Synchronization Thread Synchronization

Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Threads 0 and 1 are blocked at barrier

Time: 1 Time: 2
Thread Synchronization Thread Synchronization
Thread 0 Thread 1 Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

All threads in block have reached barrier, any thread

can continue
Time: 3 Time: 3

Thread Synchronization Thread Synchronization

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

__syncthreads(); __syncthreads(); __syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 4 Time: 5
Thread Synchronization Thread Synchronization
  Why is it important that execution time be
similar among threads?
  Why does it only synchronize within a
block?

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

Thread Synchronization Thread Synchronization

  Can __syncthreads() cause a thread
to hang? if (someFunc())
{
__syncthreads();
}
// ...
Thread Synchronization
if (someFunc())
{
__syncthreads();
}
else
{
__syncthreads();
}

Regularization For Neural Networks 1718966083
No ratings yet
Regularization For Neural Networks 1718966083
9 pages
Agentic AI - Threats and Mitigations
No ratings yet
Agentic AI - Threats and Mitigations
46 pages
TinyML: AI for Low-Power Devices
No ratings yet
TinyML: AI for Low-Power Devices
7 pages
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
5 pages
Notes - Introduction To AI, ML, DS
No ratings yet
Notes - Introduction To AI, ML, DS
61 pages
Machine Learning Algorithms Theory - Vimal Mishra
No ratings yet
Machine Learning Algorithms Theory - Vimal Mishra
931 pages
ML System Design Case Studies
No ratings yet
ML System Design Case Studies
41 pages
2019 - On The Control of Multi-Agent Systems - A Survey
No ratings yet
2019 - On The Control of Multi-Agent Systems - A Survey
164 pages
6months ML
No ratings yet
6months ML
161 pages
Autogen Guide
No ratings yet
Autogen Guide
232 pages
Synchronization Algorithms and Concurrent Programming
No ratings yet
Synchronization Algorithms and Concurrent Programming
74 pages
Encrypted Network Traffic Analysis: Aswani Kumar Cherukuri Sumaiya Thaseen Ikram Gang Li Xiao Liu
No ratings yet
Encrypted Network Traffic Analysis: Aswani Kumar Cherukuri Sumaiya Thaseen Ikram Gang Li Xiao Liu
108 pages
Intersymbolic AI Interlinking Symbolic AI and Subs
No ratings yet
Intersymbolic AI Interlinking Symbolic AI and Subs
19 pages
Valentina - Alto - Getting Started With AI Agents
No ratings yet
Valentina - Alto - Getting Started With AI Agents
19 pages
Machine Learning Workshop Guide
No ratings yet
Machine Learning Workshop Guide
133 pages
Transformer Architecture Explained
No ratings yet
Transformer Architecture Explained
8 pages
Towards Network Automation - A Multi-Agent Based Intelligent Networking System
No ratings yet
Towards Network Automation - A Multi-Agent Based Intelligent Networking System
220 pages
Gigaom Radar For Network Observability
No ratings yet
Gigaom Radar For Network Observability
26 pages
Security Orchestration, Automation and Response Buyers Guide
No ratings yet
Security Orchestration, Automation and Response Buyers Guide
20 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Isovalent Ebpf Security Observability
No ratings yet
Isovalent Ebpf Security Observability
65 pages
Transformers From Scratch
No ratings yet
Transformers From Scratch
39 pages
1.0 What Is Operating System?: Four Components of Computer System
No ratings yet
1.0 What Is Operating System?: Four Components of Computer System
8 pages
Lang Chain
No ratings yet
Lang Chain
143 pages
Geographic Coordinate Conversion
No ratings yet
Geographic Coordinate Conversion
11 pages
2024 11 15 AI Updates
No ratings yet
2024 11 15 AI Updates
20 pages
Vision-Language Models Intro Guide
No ratings yet
Vision-Language Models Intro Guide
76 pages
Pain Points - A Gentle Introduction To Rust PDF
0% (1)
Pain Points - A Gentle Introduction To Rust PDF
154 pages
Virtualizing Application Security
No ratings yet
Virtualizing Application Security
22 pages
Queueing Theory
No ratings yet
Queueing Theory
165 pages
Understanding Vector Embeddings
No ratings yet
Understanding Vector Embeddings
14 pages
StaticSpeed Security Assessment
No ratings yet
StaticSpeed Security Assessment
57 pages
Autonomous Driving Data Lake
No ratings yet
Autonomous Driving Data Lake
1 page
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
A Gentle Introduction To Graph Neural Networks
No ratings yet
A Gentle Introduction To Graph Neural Networks
9 pages
6 Types of Neural Network
No ratings yet
6 Types of Neural Network
8 pages
Module 3 - AI and ML
No ratings yet
Module 3 - AI and ML
64 pages
DeepLearning Networking
No ratings yet
DeepLearning Networking
64 pages
Responsible Design and Use of Large Language Models
No ratings yet
Responsible Design and Use of Large Language Models
12 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Autoregressive Generative Models Guide
No ratings yet
Autoregressive Generative Models Guide
57 pages
ML Observability: Key to Model Success
No ratings yet
ML Observability: Key to Model Success
13 pages
Gluon Tutorials: Deep Learning - The Straight Dope
No ratings yet
Gluon Tutorials: Deep Learning - The Straight Dope
403 pages
Transformer Architecture Guide
No ratings yet
Transformer Architecture Guide
9 pages
TensorFlow Tutorial For Beginners (Article) - DataCamp PDF
No ratings yet
TensorFlow Tutorial For Beginners (Article) - DataCamp PDF
60 pages
Federated Learning: Strategies & Applications
No ratings yet
Federated Learning: Strategies & Applications
24 pages
GANs for Financial Data Augmentation
No ratings yet
GANs for Financial Data Augmentation
8 pages
Multimodal Machine Learning Survey
No ratings yet
Multimodal Machine Learning Survey
21 pages
Advance Deep Learning
No ratings yet
Advance Deep Learning
10 pages
The Elements of Differentiable Programming
No ratings yet
The Elements of Differentiable Programming
300 pages
CNN PPT Unit Iv
100% (2)
CNN PPT Unit Iv
134 pages
11 Machine Learning System Design PDF
No ratings yet
11 Machine Learning System Design PDF
7 pages
Deeplearningsmartnetworks 190505233523
100% (1)
Deeplearningsmartnetworks 190505233523
101 pages
Artificial Intelligence and Machine Learning Digital Notes
No ratings yet
Artificial Intelligence and Machine Learning Digital Notes
185 pages
Machine Learning To Communication System
No ratings yet
Machine Learning To Communication System
10 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Hardware
No ratings yet
Hardware
54 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
A Coverage Driven Formal Methodology For Verification Sign Off Presentation
No ratings yet
A Coverage Driven Formal Methodology For Verification Sign Off Presentation
28 pages
Advance Approach For Formal Verification of Configurable Pulse Width Modulation Controller
No ratings yet
Advance Approach For Formal Verification of Configurable Pulse Width Modulation Controller
6 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
AI's Impact on Semiconductor Design
No ratings yet
AI's Impact on Semiconductor Design
13 pages

GPU Architecture

Uploaded by

GPU Architecture

Uploaded by

Who is this guy?

University of Pennsylvania developer lecturer author editor

CIS 371 Guest Lecture

How did this happen? Graphics Workloads

http://proteneer.com/blog/?p=263 Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch14.html

 CPU and GPU execute in parallel

Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf

Data Parallel NVIDIA GeForce 6 (2004)

Image from: https://plus.google.com/u/0/photos/100838748547881402137/albums/5407605084626995217/5581900335460078306 Image from http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter30.html

Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Why Unify Shader Processors? GPU Architecture Big Ideas

 Transistors are devoted to:

Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

NVIDIA G80 NVIDIA G80

Streaming Processing (SP)

Streaming Multi-Processor (SM)

GPU Computing History

this thing!  2007

 2008 – Khronos releases OpenCL

CUDA Program Execution Thread Hierarchies

 Access shared memory

 Reside on same processor core

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies Thread Hierarchies

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

Scheduling Threads Scheduling Threads

Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://courses.engr.illinois.edu/ece498/al/Syllabus.html

Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Scheduling Threads Scheduling Threads

 13 warps are required

Memory Model Thread Synchronization

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

Thread Synchronization Thread Synchronization

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

Threads 0 and 1 are blocked at barrier

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

All threads in block have reached barrier, any thread

Thread Synchronization Thread Synchronization

Thread 2 Thread 3 Thread 2 Thread 3

Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j]; Mds[i] = Md[j];

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

Thread Synchronization Thread Synchronization

You might also like

  CPU and GPU execute in parallel

  Transistors are devoted to:

this thing!   2007

  2008 – Khronos releases OpenCL

  Access shared memory

 Reside on same processor core

 13 warps are required