0% found this document useful (0 votes)

259 views33 pages

Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming

This document provides an overview of a course on Graphics Processing Units (GPUs) taught by Mohamed Zahran. The course covers GPU architecture, GPU-CPU interaction, GPU programming models, and solving problems using GPUs. It lists formal goals of understanding why GPUs are used and how to program them, as well as informal goals of learning, applying knowledge in different contexts, and enjoying the course material. The course webpage, textbook, assignments, and grading are also outlined.

Uploaded by

Viry Hernadez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

259 views33 pages

Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming

Uploaded by

Viry Hernadez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CSCI-GA.

3033-012
Graphics Processing Units (GPUs):
Architecture and Programming

Lecture 1: Introduction
Mohamed Zahran (aka Z)
mzahran@cs.nyu.edu
http://www.mzahran.com

Who Am I?
Mohamed Zahran (aka Z)
Computer architecture/OS/Compilers
Interaction
http://www.mzahran.com
Office hours: Wed 5:00-7:00 pm
Room: WWH 328
Course web page:
http://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/index.html

Formal Goals of This Course

Why GPUs
GPU Architecture
GPU-CPU Interaction
GPU programming model
Solving real-life problems using GPUs

Informal Goals of This Course

To get more than an A
To learn GPUs and enjoy it
To use what you have learned in MANY
different contexts
To have a feeling about how hardware
and software evolve

The Course Web Page

Lecture slides
Info about mailing list, labs, .
Useful links (manuals, tools, book
errata, )

The Textbook
Programming Massively Parallel
Processors: A Hands-on Approach
By
David B. Kirk & Wen-mei W. Hwu

Grading

Homework assignments
Project
Programming assignments
Final

: 10%
: 20%
: 30%
: 40%

Computer History
Eckert and Mauchly

1st working electronic

computer (1946)
18,000 Vacuum tubes
1,800 instructions/sec
3,000 ft3

Computer History
Maurice Wilkes

EDSAC 1 (1949)
http://www.cl.cam.ac.uk/UoCCL/misc/EDSAC99/

1st stored program

computer
650 instructions/sec
1,400 ft3

Intel 4004 Die Photo

Introduced in 1970
First
microprocessor

2,250 transistors
12 mm2
108 KHz

Intel 8086 Die Scan

29,000 transistors
33 mm2
5 MHz
Introduced in 1979
Basic architecture
of the IA32 PC

Intel 80486 Die Scan

1,200,000
transistors
81 mm2
25 MHz
Introduced in 1989
1st pipelined
implementation of
IA32

Pentium Die Photo

3,100,000
transistors
296 mm2
60 MHz
Introduced in 1993
1st superscalar
implementation of
IA32

Pentium III
9,500,000
transistors
125 mm2
450 MHz
Introduced in 1999

http://www.intel.com/intel/museum/25anniv/hof/hof_main.htm

Pentium 4
55,000,000
transistors
146 mm2
3 GHz
Introduced in 2000

http://www.chip-architect.com

Core 2 Duo (Merom)

Pentium 4

IBM Power 7

Montecito

Intel Core i7 (Nehalem)

Cell Processor

Niagara
(SUN UltraSparc T2)

The Famous Moores Law

Hardware Improvement

People ask for more

improvements

Positive Cycle
of Computer
Industry

People get used to the

software

Better Software

The Status-Quo
We moved from single core to multicore
for technological reasons

Free lunch is over for software folks

The software will not become faster with every

new generation of processors

Not enough experience in parallel programming

Parallel programs of old days were restricted to

some elite applications -> very few programmers
Now we need parallel programs for many different
applications

Two Main Goals

Maintain execution speed of old
sequential programs
Increase throughput of parallel
programs

Two Main Goals

Maintain execution speed of old
sequential programs
CPU
Increase throughput of parallel
programs
GPU

Many-core GPU

Multi-core CPU

Courtesy: John Owens

Figure 1.1. Enlarging Performance Gap between GPUs and CPU

ALU

Control

CPU

GPU

Cache

DRAM

CPU is optimized for sequential

code performance

ALU

Control

CPU

GPU

Cache

DRAM

ALU

Control

CPU

GPU

Cache

DRAM

Almost 10x the bandwidth of multicore

(relaxed memory model)

How to Choose A Processor for

Your Application?
Performance
Very large installation base
Practical form-factor and easy
accessibility
Support for IEEE floating point
standard

A Glimpse at A Modern GPU:

GeForce 8800 (2007)

16 highly threaded SMs, >128 FPUs,

367 GFLOPS, 768 MB DRAM,
86.4 GB/S Mem BW,
4GB/S BW to CPU

Host
Input Assembler
Thread Execution Manager

Parallel Data
Cache

Texture
Texture

Texture

Load/store

Global Memory

Load/store

A Glimpse at A Modern GPU

Streaming Multiprocessor (SM)
Host
Input Assembler
Thread Execution Manager

Parallel Data
Cache

Texture
Texture

Texture

Load/store

Global Memory

Load/store

A Glimpse at A Modern GPU

Streaming
Processor (SP)

SPs within SM share control logic

and instruction cache

Host
Input Assembler
Thread Execution Manager

Parallel Data
Cache

Texture
Texture

Texture

Load/store

Global Memory

Load/store

A Glimpse at A Modern GPU

Much higher bandwidth than typical system memory
A bit slower than typical system memory
Communication between GPU memory
and system memory is slow

Host
Input Assembler
Thread Execution Manager

Parallel Data
Cache

Texture
Texture

Texture

Load/store

Global Memory

Load/store

Amdahl's Law
Execution Time After Improvement =
Execution Time Unaffected +( Execution Time Affected / Amount of Improvement )

Example:

"Suppose a program runs in 100 seconds on a machine, with

multiply responsible for 80 seconds of this time. How much do we have to
improve the speed of multiplication if we want the program to run 4 times
faster?"
How about making it 5 times faster?

Improvement in your application speed depends on the portion that is parallelized

Two Main Things to Keep in Mind

Try to increase the portion of your
program that can be parallelized
Figure out how to get around limited
bandwidth of system memory

Enough for Today

We are done with Chapter 1
Some applications are better run on CPU
while others on GPU
The two main limitations
The parallelizable portion of the code
The communication overhead between CPU
and GPU

ISA Models
No ratings yet
ISA Models
5 pages
Wk05 - CPU Architecture (Part 1)
No ratings yet
Wk05 - CPU Architecture (Part 1)
72 pages
SimpleScalar for Researchers
No ratings yet
SimpleScalar for Researchers
3 pages
This Unit: Superscalar Execution: - Idea of Instruction-Level Parallelism - Superscalar Scaling Issues
No ratings yet
This Unit: Superscalar Execution: - Idea of Instruction-Level Parallelism - Superscalar Scaling Issues
13 pages
RTL Design
No ratings yet
RTL Design
88 pages
Power Reduction in Datapath Designs
No ratings yet
Power Reduction in Datapath Designs
10 pages
L25 - Datapath Design - p1
No ratings yet
L25 - Datapath Design - p1
46 pages
Lesson 2 - Types of Architecture & Instruction Set Architecture (ISA)
No ratings yet
Lesson 2 - Types of Architecture & Instruction Set Architecture (ISA)
74 pages
Gem5 Simulator - Tutorial: Indian Institute of Technology, Kharagpur High Performance Computer Architecture (CS60003)
No ratings yet
Gem5 Simulator - Tutorial: Indian Institute of Technology, Kharagpur High Performance Computer Architecture (CS60003)
15 pages
Preparing For An RTL Design Interview - An Exhaustive List of Topics - theDataBus
No ratings yet
Preparing For An RTL Design Interview - An Exhaustive List of Topics - theDataBus
19 pages
System Modeling & HW/SW Co-Verification: Prof. Chien-Nan Liu TEL: 03-4227151 Ext:4534 Email: Jimmy@ee - Ncu.edu - TW
No ratings yet
System Modeling & HW/SW Co-Verification: Prof. Chien-Nan Liu TEL: 03-4227151 Ext:4534 Email: Jimmy@ee - Ncu.edu - TW
153 pages
RISC-V SystemC-TLM Simulator
No ratings yet
RISC-V SystemC-TLM Simulator
4 pages
SystemVerilog Topics PPT (Autosaved)
No ratings yet
SystemVerilog Topics PPT (Autosaved)
103 pages
Simplescalar Installation
No ratings yet
Simplescalar Installation
4 pages
RTL Design
No ratings yet
RTL Design
88 pages
DSD Chapter 4
No ratings yet
DSD Chapter 4
81 pages
SimpleScalar Guide for CS Students
No ratings yet
SimpleScalar Guide for CS Students
4 pages
Verilog - Chapter2 - Fundamental Concepts
No ratings yet
Verilog - Chapter2 - Fundamental Concepts
28 pages
Lecture07 Full and Parallel
No ratings yet
Lecture07 Full and Parallel
45 pages
Power Optimisation For A 32-Bit RISC Processor
No ratings yet
Power Optimisation For A 32-Bit RISC Processor
7 pages
Digital Systems: State Machines
No ratings yet
Digital Systems: State Machines
23 pages
Verilog Coding Guideline: Author: Trumen
No ratings yet
Verilog Coding Guideline: Author: Trumen
51 pages
Integrating FV Into Your Verification Flow: Steve Holloway
No ratings yet
Integrating FV Into Your Verification Flow: Steve Holloway
16 pages
3-Overview of Embedded Systems-05!01!2024
No ratings yet
3-Overview of Embedded Systems-05!01!2024
107 pages
Full Stack VLSI Roadmap
No ratings yet
Full Stack VLSI Roadmap
4 pages
Systemc Examples
No ratings yet
Systemc Examples
8 pages
Synthesizable Verilog for RTL Design
No ratings yet
Synthesizable Verilog for RTL Design
93 pages
9-Verilog Coding and Synthesis Methodology Guidelines
No ratings yet
9-Verilog Coding and Synthesis Methodology Guidelines
23 pages
RTL and HDL 3.0
No ratings yet
RTL and HDL 3.0
156 pages
Intro To SystemC1
No ratings yet
Intro To SystemC1
43 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
AMD Gem5 APU Simulator Micro 2015 Final PDF
No ratings yet
AMD Gem5 APU Simulator Micro 2015 Final PDF
62 pages
Distributed and Parallel System: Company
No ratings yet
Distributed and Parallel System: Company
17 pages
HLS Introduction Gajski Design and Test
No ratings yet
HLS Introduction Gajski Design and Test
10 pages
SystemC-n-BehaviorCoding Fall2021 Section4 SystemC
No ratings yet
SystemC-n-BehaviorCoding Fall2021 Section4 SystemC
123 pages
Intel 82802 Firmware Hub
No ratings yet
Intel 82802 Firmware Hub
53 pages
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
No ratings yet
Advanced Computer Architecture ECE 6373: Pauline Markenscoff N320 Engineering Building 1 E-Mail: Markenscoff@uh - Edu
151 pages
System Verilog Quick View
No ratings yet
System Verilog Quick View
27 pages
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
No ratings yet
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
100 pages
Pulpissimo: Datasheet: The Pulp Team
No ratings yet
Pulpissimo: Datasheet: The Pulp Team
101 pages
Interrupts
No ratings yet
Interrupts
59 pages
Case Statement
No ratings yet
Case Statement
10 pages
SystemC-Tutorial Paderborn PDF
No ratings yet
SystemC-Tutorial Paderborn PDF
98 pages
AMD64 Architecture Programmers Manual
No ratings yet
AMD64 Architecture Programmers Manual
386 pages
Synthesizable Verilog: Dr. Paul D. Franzon
No ratings yet
Synthesizable Verilog: Dr. Paul D. Franzon
16 pages
RTL Coding Techniques
No ratings yet
RTL Coding Techniques
63 pages
21CS43 - Module 1
No ratings yet
21CS43 - Module 1
21 pages
Ch04 The Memory System
No ratings yet
Ch04 The Memory System
45 pages
VLSI Lecture02 OpenIDEA (정무경)
100% (1)
VLSI Lecture02 OpenIDEA (정무경)
69 pages
Parallel Computer Architecture Classification
No ratings yet
Parallel Computer Architecture Classification
23 pages
VLSI Job Prep: Key Questions & Skills
No ratings yet
VLSI Job Prep: Key Questions & Skills
2 pages
SV Assertions PDF
No ratings yet
SV Assertions PDF
99 pages
Z-scale: Tiny 32-bit RISC-V Systems
No ratings yet
Z-scale: Tiny 32-bit RISC-V Systems
19 pages
SystemC Language Guide & Concepts
No ratings yet
SystemC Language Guide & Concepts
23 pages
GPGPU
No ratings yet
GPGPU
139 pages
Unit 4
100% (1)
Unit 4
48 pages
Ada2024 Gpu 1
No ratings yet
Ada2024 Gpu 1
47 pages
Lec 14
No ratings yet
Lec 14
52 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
LESSON 10 - Functions
No ratings yet
LESSON 10 - Functions
6 pages
BSD Guide 4.00
No ratings yet
BSD Guide 4.00
12 pages
Bits For Mid1
100% (1)
Bits For Mid1
14 pages
Cbok 2006
No ratings yet
Cbok 2006
20 pages
DVDR and FHDB Task Cards
No ratings yet
DVDR and FHDB Task Cards
43 pages
ANSI-SPARC Architecture
No ratings yet
ANSI-SPARC Architecture
16 pages
MUCLecture 2024 2539121
No ratings yet
MUCLecture 2024 2539121
15 pages
NICDSign Installation Manual For Windows - DDO - Portal
No ratings yet
NICDSign Installation Manual For Windows - DDO - Portal
34 pages
WCS Wireless Communication by T L Singal - PDF
No ratings yet
WCS Wireless Communication by T L Singal - PDF
26 pages
Descargar Pergaminos de Cristal Claudio Pasten
No ratings yet
Descargar Pergaminos de Cristal Claudio Pasten
3 pages
Brake
No ratings yet
Brake
8 pages
LM2575
No ratings yet
LM2575
25 pages
Manual SIMOTION Web Accumulator V3.0.0
No ratings yet
Manual SIMOTION Web Accumulator V3.0.0
59 pages
A Review On Emerging Smart Technological Innovations in Healthcare Sector For Increasing Patient's Medication Adherence
No ratings yet
A Review On Emerging Smart Technological Innovations in Healthcare Sector For Increasing Patient's Medication Adherence
7 pages
CMOS 4000 Series IC List
No ratings yet
CMOS 4000 Series IC List
6 pages
Performance-Aligned Llms For Generating Fast Code
No ratings yet
Performance-Aligned Llms For Generating Fast Code
12 pages
Started On State Completed On Time Taken Marks Grade Feedback
100% (1)
Started On State Completed On Time Taken Marks Grade Feedback
4 pages
Higher Education in Afghanistan
No ratings yet
Higher Education in Afghanistan
8 pages
Openscape Business v2 Feature Description Issue 7
No ratings yet
Openscape Business v2 Feature Description Issue 7
676 pages
Swarm Intelligence Seminar
100% (1)
Swarm Intelligence Seminar
35 pages
Metaverse and Education
No ratings yet
Metaverse and Education
15 pages
Jufred Alnat Presentation
No ratings yet
Jufred Alnat Presentation
55 pages
Spice
No ratings yet
Spice
61 pages
Quickstart Guide: Trackfish 6500
No ratings yet
Quickstart Guide: Trackfish 6500
4 pages
DATA COMMUNICATION and Internet Notes Final
No ratings yet
DATA COMMUNICATION and Internet Notes Final
12 pages
六年级美术评估 worksheet
No ratings yet
六年级美术评估 worksheet
9 pages
Thesis Topics in Cloud Computing
100% (3)
Thesis Topics in Cloud Computing
8 pages
RDX QuikStation 4 Quick Start Guide
No ratings yet
RDX QuikStation 4 Quick Start Guide
2 pages
Signal Processing For Multistatic Radar Systems: Adaptive Waveform Selection
No ratings yet
Signal Processing For Multistatic Radar Systems: Adaptive Waveform Selection
407 pages
Fertilizer Forecasting Using Machine Learning
100% (1)
Fertilizer Forecasting Using Machine Learning
4 pages

Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming

Uploaded by

Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming

Uploaded by

CSCI-GA.

Formal Goals of This Course

Informal Goals of This Course

The Course Web Page

1st working electronic

1st stored program

Intel 4004 Die Photo

Intel 8086 Die Scan

Intel 80486 Die Scan

Pentium Die Photo

Core 2 Duo (Merom)

Intel Core i7 (Nehalem)

The Famous Moores Law

People ask for more

People get used to the

Free lunch is over for software folks

The software will not become faster with every

Not enough experience in parallel programming

Parallel programs of old days were restricted to

Two Main Goals

Two Main Goals

Courtesy: John Owens

Figure 1.1. Enlarging Performance Gap between GPUs and CPU

CPU is optimized for sequential

Almost 10x the bandwidth of multicore

How to Choose A Processor for

A Glimpse at A Modern GPU:

16 highly threaded SMs, >128 FPUs,

A Glimpse at A Modern GPU

A Glimpse at A Modern GPU

SPs within SM share control logic

A Glimpse at A Modern GPU

"Suppose a program runs in 100 seconds on a machine, with

Improvement in your application speed depends on the portion that is parallelized

Two Main Things to Keep in Mind

Enough for Today

You might also like