Introduction to Programming Massively Parallel Graphics processors Introduction to CUDA Programming
Andreas Moshovos moshovos@eecg.toronto.edu ECE, Univ. of Toronto Summer 2010
Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green and others as noted on slides
How to Get High Performance
Computation
Calculations Data communication/Storage
Unlimited Bandwidth Zero/Low Latency Tons of Compute Engines Tons of Storage
Calculation capabilities
How many calculation units can be built? Todays silicon chips
About 1B transistors 30K transistors for a 52b multiplier
~30K multipliers
Tons of Compute Engines
260mm^2 area (mid-range) 112microns^2 for FP unit (overestimated)
~2K FP units
Frequency ~ 3Ghz common today
TFLOPs possible
Disclaimer: back-on-the-envelop calculations take with a grain of salt
Can build lots of calculation units (ALUs)
How about Communication/Storage
Need data feed and storage The larger the slower Takes time to get there and back
Multiple cycles even on the same die
Tons of Compute Engines
Unlimited Bandwidth Zero/Low Latency
Tons of Slow Storage
Is there enough parallelism?
Unlimited Bandwidth Zero/Low Latency Tons of Compute Engines Tons of Storage
Keep this busy?
Needs lots of independent calculations
Parallelism/Concurrency
Much of what we do is sequential
First do 1, then do 2, then if X do 3 else do 4
Todays High-End General Purpose Processors
Localize Communication and Computation Try to automatically extract parallelism
Slower Cache
Faster cache
time
Tons of Slow Storage
Automatically extract instruction level parallelism Large on-die caches to tolerate off-chip memory latency
Some things are naturally parallel
Sequential Execution Model
int a[N]; // N is large for (i =0; i < N; i++)
a[i] = a[i] * fade;
Flow of control / Thread One instruction at the time Optimizations possible at the machine level
time
Data Parallel Execution Model / SIMD
int a[N]; // N is large for all elements do in parallel
a[index] = a[index] * fade;
time
This has been tried before: ILLIAC III, UIUC, 1966
Single Program Multiple Data / SPMD
int a[N]; // N is large for all elements do in parallel
if (a[i] > threshold) a[i]*= fade;
time
The model used in todays Graphics Processors
CPU vs. GPU overview
CPU:
Handles sequential code well Cant take advantage of massively parallel code Off-chip bandwidth lower Peak Computation capability lower
GPU:
Requires massively parallel computation Handles some control flow Higher off-chip bandwidth Higher peak computation capability
Programmers view
GPU as a co-processor (2008)
CPU
3GB/s 8GB.s
GPU
141GB/sec
6.4GB/sec 31.92GB/sec 8B per transfer
GPU Memory
1GB on our systems
Memory
Target Applications
int a[N]; // N is large for all elements of a compute
a[i] = a[i] * fade
Lots of independent computations
CUDA threads need not be independent
Programmers View of the GPU
GPU: a compute device that:
Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel
Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
Why are threads useful? Parallelism
Concurrency:
Do multiple things in parallel
Needs more functional units
Uses more hardware Gets higher performance
Why are threads useful #2 Tolerating stalls
Often a thread stalls, e.g., memory access
Multiplex the same functional unit Get more performance at a fraction of the cost
GPU vs. CPU Threads
GPU threads are extremely lightweight
Very little creation overhead In the order of microseconds All done in hardware
GPU needs 1000s of threads for full efficiency
Multi-core CPU needs only a few
Execution Timeline
CPU / Host
1. Copy to GPU mem 2. Launch GPU Kernel
GPU / Device
2. Synchronize with GPU time 3. Copy from GPU mem
Programmers view
First create data on CPU memory
CPU
GPU
GPU Memory Memory
Programmers view
Then Copy to GPU
CPU
GPU
GPU Memory Memory
Programmers view
GPU starts computation runs a kernel CPU can also continue
CPU
GPU
GPU Memory Memory
Programmers view
CPU and GPU Synchronize
CPU
GPU
GPU Memory Memory
Programmers view
Copy results back to CPU
CPU
GPU
GPU Memory Memory
Computation partitioning:
At the highest level:
Think of computation as a series of loops:
for (i = 0; i < big_number; i++) a[i] = some function for (i = 0; i < big_number; i++) a[i] = some other function for (i = 0; i < big_number; i++) a[i] = some other function
Kernels
Computation Partitioning -- Kernel
CUDA exposes the hardware to the programmer Programmer must manually partition work appropriately
Programmers view is hierarchical:
Think of data as an array
Per Kernel Computation Partitioning
Computation Grid: 2D Case
thread
Block
Threads within a block can communicate/synchronize
Run on the same multiprocessor
Threads across blocks cant communicate
Shouldnt touch each others data Behavior undefined
Thread Coordination Overview
Race-free access to data
GBT: Grids of Blocks of Threads
Programmers view of data and computation partitioning
Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds
Block and Thread IDs
Threads and blocks have IDs
So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)
Simplifies memory addressing when processing multidimensional data
Convenience not necessity
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
IDs and dimensions are accessible through predefined variables, e.g., blockDim.x and threadIdx.x
Execution Model: Ordering
Execution order is undefined
Do not assume and use:
block 0 executes before block 1 Thread 10 executes before thread 20 And any other ordering even if you can observe it
Future implementations may break this ordering Its not part of the CUDA definition Why? More flexible hardware options
Programmers view: Memory Model Different memories with different uses and performance
Some managed by the compiler Some must be managed by the programmer
Arrows show whether read and/or write is possible
Execution Model Summary (for your reference)
Grid of blocks of threads
1D/2D grid of blocks 1D/2D/3D blocks of threads
All blocks are identical:
same structure and # of threads
Block execution order is undefined Same block threads:
can synchronize and share data fast (shared memory)
Threads from different blocks:
Cannot cooperate Communication through global memory
Threads and Blocks have IDs
Simplifies data indexing Can be 1D, 2D, or 3D (threads)
Blocks do not migrate: execute on the same processor Several blocks may run over the same processor
CUDA Software Architecture
e.g., fft()
cuda()
cu()
Reasoning about CUDA call ordering
GPU communication via cuda() calls and kernel invocations
cudaMalloc, cudaMemCpy
Asynchronous from the CPUs perspective
CPU places a request in a CUDA queue requests are handled in-order
Streams allow for multiple queues
Order within each queue honored No order across queues
More on this much later on
My first CUDA Program
__global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; }
int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); }
GPU
CPU
CUDA API: Example
int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x;
1. 2. 3. 4. 5. 6. 7. 8. 9.
Allocate CPU Data Structure Initialize Data on CPU Allocate GPU Data Structure Copy Data from CPU to GPU Define Execution Configuration Run Kernel CPU synchronizes with GPU Copy Data from GPU to CPU De-allocate GPU and CPU memory
1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N);
...
} No memory allocated on the GPU side
Pinned memory allocation results in faster CPU to/from GPU copies But pinned memory cannot be paged-out More on this later cudaMallocHost ()
2. Initialize CPU Data (dummy) float *ha; int i;
for (i = 0; i < N; i++) ha[i] = i;
3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); Notice: no assignment side
NOT: da = cudaMalloc ()
Assignment is done internally:
Thats why we pass &da
Space is allocated in Global Memory on the GPU
GPU Memory Allocation
The host manages GPU memory allocation:
cudaMalloc (void **ptr, size_t nbytes) Must explicitly cast to (void **)
cudaMalloc ((void **) &da, sizeof (float) * N);
cudaFree (void *ptr);
cudaFree (da);
cudaMemset (void *ptr, int value, size_t nbytes);
cudaMemset (da, 0, N * sizeof (int));
Check the CUDA Reference Manual
4. Copy Initialized CPU data to GPU float *da; float *ha;
cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION
Host/Device Data Transfers
The host initiates all transfers: cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction)
Asynchronous from the CPUs perspective
CPU thread continues
In-order processing with other CUDA requests enum cudaMemcpyKind
cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice
5. Define Execution Configuration
How many blocks and threads/block
int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1;
Alternatively:
blocks = (N + threads_block 1) / threads_block;
6. Launch Kernel & 7. CPU/GPU Synchronization
Instructs the GPU to launch blocks x threads_block threads: darradd <<<blocks, threads_block>> (da, 10f, N); cudaThreadSynchronize (); // forces CPU to wait
darradd: kernel name <<<>>> execution configuration
More on this soon
(da, x, N): arguments
256 8 byte limit / No variable arguments
CPU/GPU Synchronization
CPU does not block on cuda() calls
Kernel/requests are queued and processed in-order Control returns to CPU immediately
Good if there is other work to be done
e.g., preparing for the next kernel invocation
Eventually, CPU must know when GPU is done Then it can safely copy the GPU results
cudaThreadSynchronize ()
Block CPU until all preceding cuda() and kernel requests have completed
8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);
The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) da[i] = da[i] + x;
}
BlockIdx: Unique Block ID.
Numerically asceding: 0, 1,
BlockDim: Dimensions of Block = how many threads it has
BlockDim.x, BlockDim.y, BlockDim.z Unused dimensions default to 0
ThreadIdx: Unique per Block Index
0, 1, Per Block
Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x;
blockIdx.x = 0
blockIdx.x = 1
blockIdx.x = 2
a[0]
a[63] a[64]
a[127]a[128]
a[191]a[192]
i=0
i = 63
i = 64
i = 127
i = 128
i = 191 i = 192
Assuming blockDim.x = 64
CUDA Function Declarations
Executed Only callable on the: from the: __device__ float DeviceFunc()
__global__ void __host__ KernelFunc()
device
device host
device
host host
float HostFunc()
__global__ defines a kernel function
Must return void Can only call __device__ functions
__device__ and __host__ can be used together
Two difference versions generated
__device__ Example
Add x to a[i] multiple times
__device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }
Kernel and Device Function Restrictions __device__ functions cannot have their address taken
e.g., f = &addmany; *f();
For functions executed on the device:
No recursion
darradd () { darradd () }
No static variable declarations inside the function
darradd () { static int canthavethis; }
No variable number of arguments
e.g., something like printf ()
My first CUDA Program
__global__ void arradd (float *a, float f, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) a[i] = a[i] + float; }
int main() { float h_a[N]; float *d_a; cudaMalloc ((void **) &a_d, SIZE); cudaThreadSynchronize (); cudaMemcpy (d_a, h_a, SIZE, cudaMemcpyHostToDevice)); arradd <<< n_blocks, block_size >>> (d_a, 10.0, N); cudaThreadSynchronize (); cudaMemcpy (h_a, d_a, SIZE, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL (cudaFree (a_d)); }
GPU
CPU
How to get high-performance #1
Programmer managed Scratchpad memory
Bring data in from global memory Reuse 16KB/banked Accessed in parallel by 16 threads
Programmer needs to:
Decide what to bring and when Decide which thread accesses what and when Coordination paramount
How to get high-performance #2
Global memory accesses
32 threads access memory together Can coalesce into a single reference E.g., a[threadID] works well
Control flow
32 threads run together If they diverge there is a performance penalty
Texture cache
When you think there is locality
Are GPUs really that much faster than CPUs
50x 200x speedups typically reported Recent work found
Not enough effort goes into optimizing code for CPUs
But:
The learning curve and expertise needed for CPUs is much larger
ECE Overview - ECE research Profile
Personnel and budget Partnerships with industry Biomedical Engineering Communications Computer Engineering Electromagnetics Electronics Energy Systems Photonics Systems Control
Our areas of expertise
- Slides from F. Najm (Chair) and T. Sargent (Research Vice Chair)
About our group
Computer Architecture
How to build the best possible system Best: performance, power, cost, etc.
Expertise in high-end systems
Micro-architecture Multi-processor and Multi-core systems
Current Research Support:
AMD, IBM, NSERC, Qualcomm (planned)
Claims to fame
Memory Dependence Prediction
Commercially implemented and licensed
Snoop Filtering: IBM Blue Gene
UofT-DRDC Partnership
Examples of industry research contracts with ECE in the past 8 years
AMD Agile Systems Inc Altera ARISE Technologies Asahi Kasei Microsystems Bell Canada Bell Mobility Cellular Bioscrypt Inc Broadcom Corporation Ciclon Semiconductor Cybermation Inc Digital Predictive Systems Inc. DPL Science Eastman Kodak Electro Scientific Industries EMS Technologies Exar Corp FOX-TEK Firan Technology Group Fuji Electric Fujitsu Gennum H2Green Energy Corporation Honeywell ASCa, Inc. Hydro One Networks Inc. IBM Canada Ltd. IBM IMAX Corporation Intel Corporation Jazz Semiconductor KT Micro LG Electronics Maxim MPB Technologies Microsoft Motorola Northrop Grumman NXP Semiconductors ON Semiconductor Ontario Lottery and Gaming Corp Ontario Power Generation Inc. Panasonic Semiconductor Singapore Peraso Technologies Inc. Philips Electronics North America Redline Communications Inc. Research in Motion Ltd. Right Track CAD Robert Bosch Corporation Samsung Thales Co., Ltd Semiconductor Research Corporation Siemens Aktiengesellschaft Sipex Corporation STMicroelectronics Inc. Sun Microsystems of Canada Inc. Telus Mobility Texas Instruments Toronto Hydro-Electric System Toshiba Corporation Xilinx Inc.
62
Eight Research Groups
63
1. Biomedical Engineering 2. Communications 3. Computer Engineering 4. Electromagnetics 5. Electronics 6. Energy Systems 7. Photonics 8. Systems Control
ECE
Computer Engineering Group
Human-Computer Interaction
Willy Wong, Steve Mann
Multi-sensor information systems
Parham Aarabi
Computer Hardware
Jonathan Rose, Steve Brown, Paul Chow, Jason Anderson
Computer Architecture
Greg Steffan, Andreas Moshovos, Tarek Abdelrahman, Natalie Enright Jerger
Computer Security
Davie Lie, Ashvin Goel
Neurosystems
Biomedical Engineering Berj L. Bardakjian, Roman Genov. Willy Wong, Hans Kunov Moshe Eizenman
Rehabilitation
Milos Popovic, Tom Chau.
Medical Imaging
Michael Joy, Adrian Nachman. Richard Cobbold Ofer Levi
Proteomics
Brendan Frey.
Kevin Truong.
Ca2+ Ca2+
65
Communications Group
Study of the principles, mathematics and algorithms that underpin how information is encoded, exchanged and processed
Three Sub-Groups:
1. Networks 2. Signal Processing 3. Information Theory
Sequence Analysis
Image Analysis and Computer Vision
Pattern recognition and detection
Embedded computer vision Computer vision and graphics
Networks
Quantum Cryptography and Computing
Computer Engineering
System Software
Michael Stumm, H-A. Jacobsen, Cristiana Amza, Baochun Li
Computer-Aided Design of Circuits
Farid Najm, Andreas Veneris, Jianwen Zhu, Jonathan Rose
Electronics Group
14 active professors; largest electronics group in Canada. Breadth of research topics:
Electronic device modelling Semiconductor technology VLSI CAD and Systems FPGAs DSP and Mixed-mode ICs Biomedical microsystems High-speed and mm-wave ICs and SoCs
72
Lab for (on-wafer) SoC and IC testing through 220 72 GHz UofT-IBM Partnership
Intelligent Sensory Microsystems
Mixed-signal VLSI circuits
Low-power, low-noise signal processing, computing and ADCs
On-chip micro-sensors
Electrical, chemical, optical
Project examples
Brain-chip interfaces On-chip biochemical sensors CMOS imagers
73
mm-Wave and 100+GHz systems on chip
Modelling mm-wave and noise performance of active and passive devices past 300 GHz. 60-120GHz multi-gigabit data rate phased-array radios Single-chip 76-79 GHz automotive radar 170 GHz transceiver with on-die antennas
74
Electromagnetics Group
Metamaterials: From microwaves to optics
Super-resolving lenses for imaging and sensing Small antennas Multiband RF components CMOS phase shifters
Electromagnetics of High-Speed Circuits
Signal integrity in high-speed digital systems
Microwave integrated circuit design, modeling and characterization Computational Electromagnetics
Interaction of Electromagnetic Fields with Living Tissue
Antennas
Telecom and Wireless Systems Reflectarrays Wave electronics Integrated antennas Controlled-beam antennas Adaptive and diversity antennas
METAMATERIALS (MTMs)
Super-lens capable of resolving details down to l/6
Scanning antennas with CMOS MTM chips
Small and broadband antennas
Computational Electromagnetics
Fast CAD for RF/ optical structures
Microstrip spiral inductor Optical power splitter
Modeling of Metamaterials
Plasmonic Left-Handed Media
Leaky-Wave Antennas
Energy Systems Group
Power Electronics
High power (> 1.2 MW) converters modeling, control, and digital control realization Micro-Power Grids converters for distributed resources, dc distribution systems, and HVdc systems Low-Power Electronics Integrated power supplies and power management systems-on-chip for low-power electronics
computers, cell phones, PDA-s, MP3 players, body implants
78
Harvesting Energy from humans
79
Energy Systems Research
Matrix Converter for Micro-Turbine Generator
UofT
IC for cell phone power supplies Voltage Control System for Wind Power Generators
Photonics Group
Photonics Group
Photonics Group
Photonics Group: Bio-Photonics
Systems Control Group
Basic & applied research in control engineering World-leading group in Control theory _______________________________________ ________ Optical Signal-to-Noise Ratio opt. with game theory Erbium-doped fibre amplifier design Analysis and design of digital watermarks for authentication Nonlinear control theory
application to magnetic levitation, micro positioning system