0% found this document useful (0 votes)

234 views14 pages

Why GPU?: CS8803SC Software and Hardware Cooperative Computing

gpu & graphics

Uploaded by

Sohei La

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

234 views14 pages

Why GPU?: CS8803SC Software and Hardware Cooperative Computing

gpu & graphics

Uploaded by

Sohei La

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

CS8803SC Software and Hardware Cooperative Computing GPGPU

Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Why GPU?
A quiet revolution and potential build-up
Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s

Until recently, programmed through graphics API

GFLOPS

G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

GPU in every PC and workstation massive volume and potential impact

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

Computational Power
Why are GPUs getting faster so fast?
Arithmetic intensity: the specialized nature of GPUs makes it easier to use additional transistors for computation not cache Economics: multi-billion dollar video game market is a pressure cooker that drives innovation

Architecture design decisions:

General CPU : cache, branch handling units, OOO support etc. Graphics processor: most transistors are ALUs

www.gpgpu.org/s2004/slides/luebke.Introduction.ppt

GPGPU?
http://www.gpgpu.org GPGPU stands for General-Purpose computation on GPUs General Purpose computation using GPU in applications other than 3D graphics
GPU accelerates critical paths of applications

Data parallel algorithms leverage GPU attributes

Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation

Applications
Game effects physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

Background on Graphics

Describing an Object

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

GPU Fundamentals: The Graphics Pipeline

Graphics State

Application Application
Vertices (3D)

Transform Transform
Xformed, Xformed, Lit Vertices (2D)

Rasterizer Rasterizer
Fragments (pre-pixels) (pre-

Shade Shade
Final pixels (Color, Depth)

Video Video Memory Memory (Textures) (Textures)

CPU

GPU

Render-to-texture Render-to-

A simplified graphics pipeline

Note that pipe widths vary

Many caches, FIFOs, and so on not shown

GPU Fundamentals: The Modern Graphics Pipeline

Graphics State

Application Application
Vertices (3D)

Vertex Vertex Vertex Vertex Processor Processor Processor Processor

Rasterizer Rasterizer
Xformed, Xformed, Lit Vertices (2D) Fragments (pre-pixels) (pre-

Pixel Fragment Pixel Fragment Processor Processor Processor Processor

Final pixels (Color, Depth)

Video Video Memory Memory (Textures) (Textures)

CPU

GPU

Render-to-texture Render-to-

Programmable vertex processor!

Programmable pixel processor!

GPU Pipeline: Transform

Vertex Processor (multiple operate in parallel)
Transform from world space to image space Compute per-vertex lighting

Rotate, translate, and scale the entire scene to correctly place it relative to the cameras position, view direction, and field of view.

GPU Pipeline: Rasterizer

Rasterizer
Convert geometric rep. (vertex) to image rep. (fragment) Fragment = image fragment
Pixel + associated data: color, depth, stencil, etc.

Interpolate per-vertex quantities across pixels

GPU Pipeline: Shade

Fragment Processors (multiple in parallel)
Compute a color for each pixel Optionally read colors from textures (images)

A fragment is a computer graphics term for all of the data necessary needed to generate a pixel in the frame buffer. This may include, but is not limited to: raster position depth interpolated attributes (color, texture coordinates , etc.)

NVIDIA GeForce 7800 Pipeline

GeForce 8800
16 highly threaded SMs, >128 FPUs, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU
Host Input Assembler
Thread Execution Manager

Parallel Data Cache Texture Texture

Parallel Data Cache Texture

Load/store

Global Memory
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

GPGPU Programming
Traditional GPGPU
Use a pixel processor, vortex processor, texture cache .. Copies from the frame buffer to a texture Uses a texture as the frame buffer

With CUDA
Highly parallel threads SIMD programming with MPI style

What Kinds of Computation Map Well to GPUs? Computing graphics Two key attributes:
Data parallelism Independence

Arithmetic Intensity
Arithmetic intensity = operations/works transferred

Data Streams & Kernels

Streams
Collection of records requiring similar computation
Vertex positions, Voxels, FEM cells, etc.

Provide data parallelism

Kernels
Functions applied to each element in stream
transforms, PDE,

No dependencies between stream elements

Encourage high Arithmetic Intensity

CPU-GPU Analogies CPU GPU

Inner loops = Kernels Stream / Data Array = Texture Memory Read = Texture Sample

Importance of Data Parallelism

GPUs are designed for graphics
Highly parallel tasks

GPUs process independent vertices & fragments

Temporary registers are zeroed No shared or static data No read-modify-write buffers

Data-parallel processing
GPUs architecture is ALU-heavy
Multiple vertex & pixel pipelines, multiple ALUs per pipe

Hide memory latency (with more computation)

Example: Simulation Grid

Common GPGPU computation style
Textures represent computational grids = streams

Many computations map to grids

Matrix algebra Image & Volume processing Physical simulation Global Illumination
ray tracing, photon mapping, radiosity

Non-grid streams can be mapped to grids

Programming a GPU for Graphics

Application specifies geometry rasterized Each fragment is shaded w/ SIMD program Shading can use values from texture memory Image can be used as texture on future passes

Owens & Luebke

Programming a GPU for GP Programs

Draw a screen-sized quad stream Run a SIMD kernel over each fragment Gather is permitted from texture memory Resulting buffer can be treated as texture on next pass

Owens & Luebke

Kernels
CPU GPU

Kernel / loop body / algorithm step = Fragment Program

Owens & Luebke

CUDA Programming Model: A Highly Multithreaded Coprocessor

The GPU is viewed as a compute device that:
Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel

Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead Multi-core CPU needs only a few

GPU needs 1000s of threads for full efficiency

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

CUDA: Matrix Multiplication

__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; } }

Limitations in GPGPU
High latency between CPU-GPU Handling control flow graphs I/O access Bit operations Limited data structure (e.g., no link list)

But then why are we looking at this?

Relatively short development time Relatively cheap devices

Is GPU always Good?

50 45 40 35 30 25 20 15 10 5 0 0 200 400 600 800 1000 1200 OpenMP CUDA

Not enough data parallelism GPU overhead is higher than the benefit

Parallel programming is difficult. GPGPU could be one solution to utilize parallel processors.

The Future of GPGPU?

Architecture is a moving target. Programming environment is evolving. e.g.) Intels Larrabee (09 expected)
core core core core core core core core core core $ $ $ $ $ $ $ $ $ $ core core core core core core core core core core $ $ $ $ $ $ $ $ $ $

MIMD style Can it provide enough performance to beat Nvidia??

Lec07 Memory sp17
No ratings yet
Lec07 Memory sp17
99 pages
EE292A Lecture 1.intro
No ratings yet
EE292A Lecture 1.intro
61 pages
GCD Flow
No ratings yet
GCD Flow
34 pages
Architectural Support For High Speed Protection of Memory Integrity and Confidentiality in Multiprocessor Systems
No ratings yet
Architectural Support For High Speed Protection of Memory Integrity and Confidentiality in Multiprocessor Systems
29 pages
En Pca 3 0 Latest Install
No ratings yet
En Pca 3 0 Latest Install
106 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
Introduction To Electronics Packaging and Heterogeneous Integration
No ratings yet
Introduction To Electronics Packaging and Heterogeneous Integration
35 pages
SystemC-n-BehaviorCoding Fall2021 Section4 SystemC
No ratings yet
SystemC-n-BehaviorCoding Fall2021 Section4 SystemC
123 pages
Verilog Designers Library 0130811548 9780130811547 - Compress
No ratings yet
Verilog Designers Library 0130811548 9780130811547 - Compress
430 pages
1.3 Future Scaling: Where Systems and Technology Meet: 25 Digest of Technical Papers
No ratings yet
1.3 Future Scaling: Where Systems and Technology Meet: 25 Digest of Technical Papers
5 pages
Advanced VLSI Design: Dr. Premananda B.S
No ratings yet
Advanced VLSI Design: Dr. Premananda B.S
339 pages
SystemC N BehaviorCoding Section2
No ratings yet
SystemC N BehaviorCoding Section2
110 pages
Energy Effi Cient Embedded Video Processing Systems: Muhammad Usman Karim Khan Muhammad Shafi Que Jörg Henkel
No ratings yet
Energy Effi Cient Embedded Video Processing Systems: Muhammad Usman Karim Khan Muhammad Shafi Que Jörg Henkel
242 pages
HLS Introduction Gajski Design and Test
No ratings yet
HLS Introduction Gajski Design and Test
10 pages
EE292A Lecture 2.ML - Hardware - 2 - April9
No ratings yet
EE292A Lecture 2.ML - Hardware - 2 - April9
13 pages
An Integrated Tutorial On InfiniBand, Verbs, and MPI PDF
No ratings yet
An Integrated Tutorial On InfiniBand, Verbs, and MPI PDF
33 pages
System On Chip SOC
No ratings yet
System On Chip SOC
25 pages
System Busses / Networks-on-Chip: EECE 579 - Advanced Topics in VLSI Design Spring 2009 Brad Quinton
No ratings yet
System Busses / Networks-on-Chip: EECE 579 - Advanced Topics in VLSI Design Spring 2009 Brad Quinton
102 pages
Astro 2004 7
No ratings yet
Astro 2004 7
322 pages
ESSCIRC2019 Tutorial Ali Sheikholeslami
No ratings yet
ESSCIRC2019 Tutorial Ali Sheikholeslami
66 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
H - LLM: Hardware-Dataflow Co-Exploration For Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
No ratings yet
H - LLM: Hardware-Dataflow Co-Exploration For Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
17 pages
Low Power Design in Digital Systems
No ratings yet
Low Power Design in Digital Systems
28 pages
DDR4 Memory Controller IP - Xilinx
No ratings yet
DDR4 Memory Controller IP - Xilinx
130 pages
AI Accelerator
No ratings yet
AI Accelerator
5 pages
Electronics System Design Using FPGA
No ratings yet
Electronics System Design Using FPGA
15 pages
Youn-Long Steve Lin - Essential Issues in Soc Design - Designing Complex Systems-On-Chip-Springer (2010)
No ratings yet
Youn-Long Steve Lin - Essential Issues in Soc Design - Designing Complex Systems-On-Chip-Springer (2010)
405 pages
Introduction NoC Paper PDF
No ratings yet
Introduction NoC Paper PDF
12 pages
Memory Tech Trends for Engineers
No ratings yet
Memory Tech Trends for Engineers
6 pages
IEDM - The Big Decisions For 5nm - Breakfast Bytes - Cadence Blogs - Cadence Community
No ratings yet
IEDM - The Big Decisions For 5nm - Breakfast Bytes - Cadence Blogs - Cadence Community
8 pages
AMD Gem5 APU Simulator Micro 2015 Final PDF
No ratings yet
AMD Gem5 APU Simulator Micro 2015 Final PDF
62 pages
Alternate Protocol Negotiation in A High Performance Interconnect
No ratings yet
Alternate Protocol Negotiation in A High Performance Interconnect
40 pages
VLSI's Role in AI Development
No ratings yet
VLSI's Role in AI Development
4 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
A Brief Overview of The Graphics Pipeline: Cedric Lee
No ratings yet
A Brief Overview of The Graphics Pipeline: Cedric Lee
33 pages
Flow DW
No ratings yet
Flow DW
200 pages
SoC or System On Chip Seminar Report
No ratings yet
SoC or System On Chip Seminar Report
25 pages
Full-Scan: (Lecture 19alt in The Alternative Sequence)
No ratings yet
Full-Scan: (Lecture 19alt in The Alternative Sequence)
22 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Low Power Design Techniques
No ratings yet
Low Power Design Techniques
26 pages
Opencores Coding Guidelines
No ratings yet
Opencores Coding Guidelines
28 pages
OceanofPDF - Com Understanding Logic Locking - Kimia Zamiri Azar
No ratings yet
OceanofPDF - Com Understanding Logic Locking - Kimia Zamiri Azar
526 pages
Chapter 05
No ratings yet
Chapter 05
25 pages
Ec20 R2.1 Mini Pcie: Hardware Design
No ratings yet
Ec20 R2.1 Mini Pcie: Hardware Design
52 pages
Contents Mastering FPGA Chip Design
0% (1)
Contents Mastering FPGA Chip Design
9 pages
Amba
No ratings yet
Amba
7 pages
PCI Express Basics & Background: Richard Solomon Synopsys
100% (1)
PCI Express Basics & Background: Richard Solomon Synopsys
45 pages
Trends in IP
100% (1)
Trends in IP
25 pages
Mvsim Pag
No ratings yet
Mvsim Pag
16 pages
RISC-V Control in CS61C Lecture
100% (1)
RISC-V Control in CS61C Lecture
25 pages
Axi BFM
No ratings yet
Axi BFM
85 pages
White Paper Interconnect Solutions Debugging Issues Advanced ARM CoreLink
No ratings yet
White Paper Interconnect Solutions Debugging Issues Advanced ARM CoreLink
8 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPU Insights for Tech Enthusiasts
No ratings yet
GPU Insights for Tech Enthusiasts
35 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Gpus
No ratings yet
Gpus
32 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
Control Logic Implementation and Microprogramming: 9/6/06 ECE 411, Lecture 4
No ratings yet
Control Logic Implementation and Microprogramming: 9/6/06 ECE 411, Lecture 4
30 pages
Universal Application-Specific Integrated Circuit For Bioelectric Data Acquisition
No ratings yet
Universal Application-Specific Integrated Circuit For Bioelectric Data Acquisition
7 pages
Ps 2 Keyboard Controller
No ratings yet
Ps 2 Keyboard Controller
2 pages
Xilinx CORE Generator
No ratings yet
Xilinx CORE Generator
1 page
Creating A Verilog-Based Component: Creator ™
No ratings yet
Creating A Verilog-Based Component: Creator ™
5 pages
Solution Inverter
33% (3)
Solution Inverter
12 pages
Final
No ratings yet
Final
56 pages
10th Lecture: Multiple-Issue Processors: Please Recall: Branch Prediction
No ratings yet
10th Lecture: Multiple-Issue Processors: Please Recall: Branch Prediction
28 pages
3D Graphics Content Over OCP: Martti Venell Sr. Verification Engineer Bitboys
No ratings yet
3D Graphics Content Over OCP: Martti Venell Sr. Verification Engineer Bitboys
13 pages
Cognos BI 10.2 Installation Guide
No ratings yet
Cognos BI 10.2 Installation Guide
4 pages
Computer Terminology Glossary
No ratings yet
Computer Terminology Glossary
19 pages
Comando S
No ratings yet
Comando S
3 pages
Map 950 Iso 3
No ratings yet
Map 950 Iso 3
20 pages
Db2 for z/OS Best Practices
No ratings yet
Db2 for z/OS Best Practices
33 pages
Uniflow - Port Requirement
No ratings yet
Uniflow - Port Requirement
7 pages
VTV 440 8x64ad2
No ratings yet
VTV 440 8x64ad2
22 pages
Wincor Nixdorf Wincor Nixdorf: Tpiscan Workshop For Customizers
No ratings yet
Wincor Nixdorf Wincor Nixdorf: Tpiscan Workshop For Customizers
14 pages
License Keys Free Window 7
No ratings yet
License Keys Free Window 7
9 pages
Embedded System Notes
No ratings yet
Embedded System Notes
8 pages
Glpi Plugins Readthedocs Io PT Latest
No ratings yet
Glpi Plugins Readthedocs Io PT Latest
125 pages
Airline Management System (Pujita Deshpande)
100% (1)
Airline Management System (Pujita Deshpande)
6 pages
OS PPT Introduction
No ratings yet
OS PPT Introduction
43 pages
HOW TO Windows
No ratings yet
HOW TO Windows
7 pages
Configuring IRS Devices
No ratings yet
Configuring IRS Devices
29 pages
6.4.3.3 Packet Tracer - Connect A Router To A LAN
No ratings yet
6.4.3.3 Packet Tracer - Connect A Router To A LAN
5 pages
Scala3-Amd: Check Approval Design
No ratings yet
Scala3-Amd: Check Approval Design
52 pages
BIO Gripper User Manual-V1.6.1
No ratings yet
BIO Gripper User Manual-V1.6.1
36 pages
Cisco CCNA Q&As Ebook NB PDF
No ratings yet
Cisco CCNA Q&As Ebook NB PDF
22 pages
MS 7528 1.1
No ratings yet
MS 7528 1.1
36 pages
AMIE-Computer Network Old Question Sorted
100% (3)
AMIE-Computer Network Old Question Sorted
10 pages
Subnetting and Subnet Mask Explained
100% (1)
Subnetting and Subnet Mask Explained
3 pages
Embedded Systems Design and Development
No ratings yet
Embedded Systems Design and Development
2 pages
System Software and Programming Techniques 1
No ratings yet
System Software and Programming Techniques 1
5 pages
3.microelectronics History Tech Dev
No ratings yet
3.microelectronics History Tech Dev
16 pages
Identitas Peserta Puskesmas
100% (1)
Identitas Peserta Puskesmas
176 pages
Pic Slides
No ratings yet
Pic Slides
370 pages
Backup Solution Guide Enu
No ratings yet
Backup Solution Guide Enu
12 pages
MIC Chapter 1
No ratings yet
MIC Chapter 1
62 pages
Unit 3 Topic 8 Flume and Scoop
No ratings yet
Unit 3 Topic 8 Flume and Scoop
35 pages

Why GPU?: CS8803SC Software and Hardware Cooperative Computing

Uploaded by

Why GPU?: CS8803SC Software and Hardware Cooperative Computing

Uploaded by

CS8803SC Software and Hardware Cooperative Computing GPGPU

Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Until recently, programmed through graphics API

GPU in every PC and workstation massive volume and potential impact

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

Architecture design decisions:

Data parallel algorithms leverage GPU attributes

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC

GPU Fundamentals: The Graphics Pipeline

Video Video Memory Memory (Textures) (Textures)

A simplified graphics pipeline

Many caches, FIFOs, and so on not shown

GPU Fundamentals: The Modern Graphics Pipeline

Vertex Vertex Vertex Vertex Processor Processor Processor Processor

Pixel Fragment Pixel Fragment Processor Processor Processor Processor

Final pixels (Color, Depth)

Video Video Memory Memory (Textures) (Textures)

Programmable vertex processor!

Programmable pixel processor!

GPU Pipeline: Transform

GPU Pipeline: Rasterizer

Interpolate per-vertex quantities across pixels

GPU Pipeline: Shade

NVIDIA GeForce 7800 Pipeline

Parallel Data Cache Texture Texture

Parallel Data Cache Texture

Parallel Data Cache Texture

Parallel Data Cache Texture

Parallel Data Cache Texture

Parallel Data Cache Texture

Parallel Data Cache Texture

Parallel Data Cache Texture

Data Streams & Kernels

Provide data parallelism

No dependencies between stream elements

CPU-GPU Analogies CPU GPU

Importance of Data Parallelism

GPUs process independent vertices & fragments

Hide memory latency (with more computation)

Example: Simulation Grid

Many computations map to grids

Non-grid streams can be mapped to grids

Programming a GPU for Graphics

Owens & Luebke

Programming a GPU for GP Programs

Owens & Luebke

Kernel / loop body / algorithm step = Fragment Program

CUDA Programming Model: A Highly Multithreaded Coprocessor

GPU needs 1000s of threads for full efficiency

CUDA: Matrix Multiplication

But then why are we looking at this?

Is GPU always Good?

The Future of GPGPU?

MIMD style Can it provide enough performance to beat Nvidia??

You might also like