0% found this document useful (0 votes)

150 views20 pages

Gpu1 - GPU Introduction

This document provides an introduction to GPU computing. It discusses the history of graphics processing and how dedicated graphics cards evolved to use massively parallel architectures to improve performance. Modern GPUs have thousands of cores and very high memory bandwidth. Programming frameworks like CUDA and OpenCL allow general purpose programming on GPUs. Efficiently mapping algorithms to make use of thousands of GPU threads requires consideration of resources like shared memory, registers, and memory access patterns.

Uploaded by

Richik Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

150 views20 pages

Gpu1 - GPU Introduction

Uploaded by

Richik Dutta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

CS 295: Modern Systems

GPU Computing Introduction

Sang-Woo Jun
Spring 2019
Graphic Processing – Some History
1990s: Real-time 3D rendering for video games were becoming common
o Doom, Quake, Descent, … (Nostalgia!)
3D graphics processing is immensely computation-intensive

Texture mapping Shading

Warren Moore, “Textures and Samplers in Metal,” Metal by Example, 2014
Gray Olsen, “CSE 470 Assignment 3 Part 2 - Gourad/Phong Shading,” grayolsen.com, 2018
Graphic Processing – Some History
Before 3D accelerators (GPUs) were common
 CPUs had to do all graphics computation, while maintaining framerate!
o Many tricks were played
Doom (1993) : “Affine texture mapping”
• Linearly maps textures to screen location,
disregarding depth
• Doom levels did not have slanted walls or ramps,
to hide this
Graphic Processing – Some History
Before 3D accelerators (GPUs) were common
 CPUs had to do all graphics computation, while maintaining framerate!
o Many tricks were played
Quake III arena (1999) : “Fast inverse square root”
magic!
Introduction of 3D Accelerator Cards
Much of 3D processing is short algorithms repeated on a lot of data
o pixels, polygons, textures, …
Dedicated accelerators with simple, massively parallel computation

A Diamond Monster 3D, using the Voodoo chipset (1997)

(Konstantin Lanzet, Wikipedia)
NVIDIA Volta-based GV100 Architecture (2018)

Many many cores,

not a lot of cache/control
Peak Performance vs. CPU
Throughput Power Throughput/Power
Intel Skylake 128 SP GFLOPS/4 Cores 100+ Watts ~1 GFLOPS/Watt
NVIDIA V100 15 TFLOPS 200+ Watts ~75 GFLOPS/Watt

Also,
System Architecture Snapshot With a GPU
(2019)
GPU Memory
GDDR5: 100s GB/s, 10s of GB
(GDDR5, GPU
HBM2: ~1 TB/s, 10s of GB
HBM2,…)

NVMe
CPU DDR4 2666 MHz I/O Hub (IOH) Network
128 GB/s Interface
100s of GB
…
QPI/UPI
12.8 GB/s (QPI) PCIe
Host Memory 20.8 GB/s (UPI) 16-lane PCIe Gen3: 16 GB/s
(DDR4,…) …
Lots of moving parts!
High-Performance Graphics Memory
Modern GPUs even employing 3D-stacked memory via silicon interposer
o Very wide bus, very high bandwidth
o e.g., HBM2 in Volta

Graphics Card Hub, “GDDR5 vs GDDR5X vs HBM vs HBM2 vs GDDR6 Memory Comparison,” 2019
Massively Parallel Architecture For
Massively Parallel Workloads!
NVIDIA CUDA (Compute Uniform Device Architecture) – 2007
o A way to run custom programs on the massively parallel architecture!
OpenCL specification released – 2008
Both platforms expose synchronous execution of a massive number of
threads GPU Threads
…

GPU

Thread Copy over PCIe Copy over PCIe

CPU
CUDA Execution Abstraction
Block: Multi-dimensional array of threads
o 1D, 2D, or 3D
o Threads in a block can synchronize among themselves
o Threads in a block can access shared memory
o CUDA (Thread, Block) ~= OpenCL (Work item, Work group)
Grid: Multi-dimensional array of blocks
o 1D or 2D
o Blocks in a grid can run in parallel, or sequentially
Kernel execution issued in grid units
Limited recursion (depth limit of 24 as of now)
Simple CUDA Example
Asynchronous call
CPU side GPU side

C/C++ Host Compiler

+ CUDA NVCC CPU+GPU
Code Compiler Software
Device
Compiler
Simple CUDA Example
1 block __global__:
N threads per block
In GPU, called from host/GPU
__device__:
In GPU, called from GPU
__host__:
Should wait for kernel to finish
In host, called from host
N instances of VecAdd spawned in GPU

One function can

be both
Which of N threads am I?
Only void allowed See also: blockIdx
More Complex Example:
Picture Blurring
Slides from NVIDIA/UIUC Accelerated Computing Teaching Kit
Another end-to-end example
https://devblogs.nvidia.com/even-easier-introduction-cuda/

Great! Now we know how to use GPUs – Bye?

Matrix Multiplication
Performance Engineering

No faster than CPU

Results from NVIDIA P100

Coleman et. al., “Efficient CUDA,” 2017 Architecture knowledge is needed (again)
NVIDIA Volta-based GV100 Architecture (2018)

Single Streaming Multiprocessor (SM) has

64 INT32 cores and 64 FP32 cores
(+8 Tensor cores…)

GV100 has 84 SMs

Volta Execution Architecture
64 INT32 Cores, 64 FP32 Cores, 4 Tensor Cores, Ray-
tracing cores..
o Specialization to make use of chip space…?
Not much on-chip memory per thread
o 96 KB Shared memory
o 1024 Registers per FP32 core
Hard limit on compute management
o 32 blocks AND 2048 threads AND 1024 threads/block
o e.g., 2 blocks with 1024 threads, or 4 blocks with 512 threads
o Enough registers/shared memory for all threads must be
available (all context is resident during execution)
More threads than cores – Threads interleaved to hide memory latency
Resource Balancing Details
How many threads in a block?
Too small: 4x4 window == 16 threads
o 128 blocks to fill 2048 thread/SM
o SM only supports 32 blocks -> only 512 threads used
• SM has only 64 cores… does it matter? Sometimes!
Too large: 32x48 window == 1536 threads
o Threads do not fit in a block!
Too large: 1024 threads using more than 64 registers
Limitations vary across platforms (Fermi, Pascal, Volta, …)
Warp Scheduling Unit
Threads in a block are executed in 32-thread “warp” unit
o Not part of language specs, just architecture specifics
o A warp is SIMD – Same PC, same instructions executed on every core
What happens when there is a conditional statement?
o Prefix operations, or control divergence
o More on this later!
Warps have been 32-threads so far, but may change in the future
Memory Architecture Caveats
Shared memory peculiarities
o Small amount (e.g., 96 KB/SM for Volta) shared across all threads
o Organized into banks to distribute access
o Bank conflicts can drastically lower performance
Relatively slow global memory
o Blocking, caching becomes important (again)
o If not for performance, for power consumption…

8-way bank conflict

1/8 memory bandwidth

GPU Basics
No ratings yet
GPU Basics
93 pages
GPGPU
No ratings yet
GPGPU
139 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
LINUX Kernel: Introduction To The Kernel
100% (1)
LINUX Kernel: Introduction To The Kernel
105 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
Cuda C
No ratings yet
Cuda C
70 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
AMD OpenCL Programming User Guide
No ratings yet
AMD OpenCL Programming User Guide
180 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
ARM Processor
No ratings yet
ARM Processor
296 pages
Introduction NoC Paper PDF
No ratings yet
Introduction NoC Paper PDF
12 pages
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
ECN TLP Prefix 2008-12-15
100% (1)
ECN TLP Prefix 2008-12-15
19 pages
Chapter 06
No ratings yet
Chapter 06
76 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Mmwave - SDK - User - Guide 1.0.0
No ratings yet
Mmwave - SDK - User - Guide 1.0.0
64 pages
What Is Direct Memory Access (DMA) and Why Should We Know About It?
No ratings yet
What Is Direct Memory Access (DMA) and Why Should We Know About It?
23 pages
Chapter 6 Parallel Processor
No ratings yet
Chapter 6 Parallel Processor
21 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Parallel Computer Architecture Classification
No ratings yet
Parallel Computer Architecture Classification
23 pages
Microsoft PowerPoint - SoC Design Flow Tools Codesign
No ratings yet
Microsoft PowerPoint - SoC Design Flow Tools Codesign
110 pages
March 2019 NVMe TCP What You Need To Know About The Specification
No ratings yet
March 2019 NVMe TCP What You Need To Know About The Specification
34 pages
primerHW SWinterface
No ratings yet
primerHW SWinterface
107 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
Nvidia h100 Datasheet 2430615
No ratings yet
Nvidia h100 Datasheet 2430615
4 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Amba
No ratings yet
Amba
7 pages
NVM Express 1 - 1a
No ratings yet
NVM Express 1 - 1a
166 pages
DDI0403E D Armv7m Arm PDF
No ratings yet
DDI0403E D Armv7m Arm PDF
858 pages
DPDK Guide for Network Engineers
No ratings yet
DPDK Guide for Network Engineers
10 pages
Parallelizing The Naughty Dog Engine Using Fibers
No ratings yet
Parallelizing The Naughty Dog Engine Using Fibers
94 pages
2019 DVCon India Modern SystemC.v2 - 4.3
No ratings yet
2019 DVCon India Modern SystemC.v2 - 4.3
41 pages
Hc2024 Amd Vpeng
No ratings yet
Hc2024 Amd Vpeng
36 pages
SG 247575
No ratings yet
SG 247575
666 pages
Cache Design
No ratings yet
Cache Design
59 pages
1 Cuda
100% (1)
1 Cuda
173 pages
A Microprocessor
No ratings yet
A Microprocessor
13 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
A Graphics Processing Unit
No ratings yet
A Graphics Processing Unit
14 pages
NVLink Network Switch Overview
100% (1)
NVLink Network Switch Overview
23 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Hardware
No ratings yet
Hardware
54 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Webpdf
No ratings yet
Webpdf
345 pages
S.T.A.L.K.E.R. Complete 2009 Mod Enhancements
No ratings yet
S.T.A.L.K.E.R. Complete 2009 Mod Enhancements
18 pages
Star Wars - Jedi Knight III - Jedi Academy - Manual - PC
No ratings yet
Star Wars - Jedi Knight III - Jedi Academy - Manual - PC
28 pages
3d Geospatial Visualization of The Ucsc Campus
No ratings yet
3d Geospatial Visualization of The Ucsc Campus
9 pages
S.T.a.L.K.E.R. Clear Sky - Manual
100% (1)
S.T.a.L.K.E.R. Clear Sky - Manual
14 pages
Film Art Phenomena - Nicky Hamlyn - 2003 - British Film Institute - 9781838710293 - Anna's Archive
No ratings yet
Film Art Phenomena - Nicky Hamlyn - 2003 - British Film Institute - 9781838710293 - Anna's Archive
207 pages
Visvesvaraya Technological University: "Copter"
No ratings yet
Visvesvaraya Technological University: "Copter"
34 pages
Neural Watertight Manifold Meshes
No ratings yet
Neural Watertight Manifold Meshes
16 pages
Brixel - FAQ
No ratings yet
Brixel - FAQ
2 pages
Vray Ingles
No ratings yet
Vray Ingles
279 pages
Interior Design Rendering Techniques
No ratings yet
Interior Design Rendering Techniques
12 pages
Materiales Enscape 3.0
No ratings yet
Materiales Enscape 3.0
19 pages
LightWave Modeling Texturing
100% (1)
LightWave Modeling Texturing
497 pages
Log
No ratings yet
Log
134 pages
Developing Graphics Frameworks With Python and OpenGL 1st Edition by Lee Stemkoski, Michael Pascale ISBN 0367721805 Â 978-0367721800 Download
100% (3)
Developing Graphics Frameworks With Python and OpenGL 1st Edition by Lee Stemkoski, Michael Pascale ISBN 0367721805 Â 978-0367721800 Download
75 pages
Terragen 2 Beginner Tutorial
0% (1)
Terragen 2 Beginner Tutorial
11 pages
AI Fantasy and Anti Co-Creative Machines On Singul
No ratings yet
AI Fantasy and Anti Co-Creative Machines On Singul
12 pages
EAPP How To Critique Different Art Forms
No ratings yet
EAPP How To Critique Different Art Forms
2 pages
Painkiller Overdose Manual UK
No ratings yet
Painkiller Overdose Manual UK
19 pages
Blender 4.3 Simplified For Professionals (Mac Brandy) (Z-Library)
No ratings yet
Blender 4.3 Simplified For Professionals (Mac Brandy) (Z-Library)
289 pages
Google Sketchup Tutorial
100% (1)
Google Sketchup Tutorial
23 pages
B.Tech. IT VII Sem (207 Credits) PDF
No ratings yet
B.Tech. IT VII Sem (207 Credits) PDF
34 pages
Kris Costa Class Program
No ratings yet
Kris Costa Class Program
5 pages
Evidencia Diagrama de Flujo GA5-240202501-AA1-EV01
No ratings yet
Evidencia Diagrama de Flujo GA5-240202501-AA1-EV01
7 pages
GrutBrushes Art Surfaces Manual
No ratings yet
GrutBrushes Art Surfaces Manual
8 pages
Lab Manual 509
No ratings yet
Lab Manual 509
29 pages
C++ - OpenGL Concept Question - Stack Overflow
No ratings yet
C++ - OpenGL Concept Question - Stack Overflow
3 pages
AutoCAD 2014 Material Rendering Guide
No ratings yet
AutoCAD 2014 Material Rendering Guide
12 pages
Vismat Material V-Ray For Sketchup
No ratings yet
Vismat Material V-Ray For Sketchup
19 pages
Python FPS Game Development Report
No ratings yet
Python FPS Game Development Report
23 pages

Gpu1 - GPU Introduction

Uploaded by

Gpu1 - GPU Introduction

Uploaded by

CS 295: Modern Systems

GPU Computing Introduction

Texture mapping Shading

A Diamond Monster 3D, using the Voodoo chipset (1997)

Many many cores,

Thread Copy over PCIe Copy over PCIe

C/C++ Host Compiler

One function can

Great! Now we know how to use GPUs – Bye?

No faster than CPU

Results from NVIDIA P100

Single Streaming Multiprocessor (SM) has

GV100 has 84 SMs

8-way bank conflict

You might also like