100% found this document useful (2 votes)

232 views96 pages

Accelerating Scientific Computing with GPUs

This document discusses accelerating scientific computing using graphics hardware. It provides background on increasing transistor counts and parallel processing using GPUs. It discusses GPU programming models and an example application to computational fluid dynamics using the lattice Boltzmann method. Kernels, grids, blocks and threads are key abstractions for GPU programming.

Uploaded by

Jino Goju Stark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

232 views96 pages

Accelerating Scientific Computing with GPUs

Uploaded by

Jino Goju Stark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Acceleration of scientific computing

using graphics hardware

Graham Pullan
Whittle Lab
Engineering Department, University of Cambridge
28 May 2008 I’ve added some notes that
weren’t on the original slides to
help readers of the online pdf
version.
Coming up...
• Background
• CPUs and GPUs
• GPU programming models
• An example – CFD
• Alternative devices
• Conclusions
Part 1: Background
Whittle Lab

Courtesy Vicente Jerez

Fidalgo, Whittle Lab
CFD basics

Body-fitted mesh
CFD basics

Body-fitted mesh For each cell, conserve:

• mass
• momentum
• energy
and update flow properties
Approximate compute requirements
“Steady” models (no wake/blade interaction, etc)
1 blade 0.5 Mcells 1 CPU hour
1 stage (2 blades) 1.0 Mcells 3 CPU hours
1 component (5 stages) 5.0 Mcells 20 CPU hours
Approximate compute requirements
“Steady” models (no wake/blade interaction, etc)
1 blade 0.5 Mcells 1 CPU hour
1 stage (2 blades) 1.0 Mcells 3 CPU hours
1 component (5 stages) 5.0 Mcells 20 CPU hours

“Unsteady” models (with wakes, etc)

1 component (1000 blades) 500 Mcells 0.1 M CPU hours
Engine (4000 blades) 2 Gcells 1 M CPU hours
Graham’s coding experience:
• FORTRAN
• C
• MPI
Graham’s coding experience:

• C

•MPI
Part 2: CPUs and GPUs
Moore’s Law
“The complexity for minimum component costs has
increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to
continue.”
Gordon Moore (Intel), 1965
Moore’s Law
“The complexity for minimum component costs has
increased at a rate of roughly a factor of two per year.
Certainly over the short term this rate can be expected to
continue.”
Gordon Moore (Intel), 1965

“OK, maybe a factor of two every two years.”

Gordon Moore (Intel), 1975 [paraphrased]
Was Moore right?

Source:
ftp://download.intel.com/resea
rch/silicon/Gordon_Moore_ISSC
C_021003.pdf

Source: Intel
Was Moore right?

Source: Intel
Feature size

Source:
ftp://download.intel.com/resea
rch/silicon/Gordon_Moore_ISSC
C_021003.pdf

Source: Intel
Clock speed

Source:
http://www.tomshardware.com/2005/11/21/the_mother_of_al
l_cpu_charts_2005/index.html
Power – the Clock speed limiter?
• 1 GHz CPU requires ≈ 25 W
• 3 GHz CPU requires ≈ 100 W
Power – the Clock speed limiter?
• 1 GHz CPU requires ≈ 25 W
• 3 GHz CPU requires ≈ 100 W

“The total of electricity consumed by major search

engines in 2006 approaches 5 GW.” – Wired / AMD

Source:
http://www.hotchips.org/hc19/docs/keynote2.pdf
What to do with all these transistors?
Parallel computing
Multi-core chips are either:

– Instruction parallel
(Multipile Instruction, Multiple Data) – MIMD

– Data parallel
(Single Instruction, Multiple Data) – SIMD
Today’s commodity MIMD chips: CPUs
Intel Core 2 Quad
• 4 cores
• 2.4 GHz
• 65nm features
• 582 million transistors
• 8MB on chip memory
Today’s commodity SIMD chips: GPUs
NVIDIA 8800 GTX
• 128 cores
• 1.35 GHz
• 90nm features
• 681 million transistors
• 768MB on board memory
CPUs vs GPUs

Source:
http://www.eng.cam.ac.uk/~gp10006/research/Brandvi
k_Pullan_2008a_DRAFT.pdf
CPUs vs GPUs
Transistor usage:

Source: NVIDIA CUDA SDK

documentation
Graphics pipeline

Source:
ftp://download.nvidia.com/developer/presentations/200
4/Perfect_Kitchen_Art/English_Evolution_of_GPUs.pdf
Graphics pipeline
GPUs and scientific computing

GPUs are designed to apply the

same shading function
to many pixels simultaneously
GPUs and scientific computing

GPUs are designed to apply the

same function
to many data simultaneously

This is what most scientific computing needs!

Part 3: Programming methods
3 Generations of GPGPU (Owens, 2008)
3 Generations of GPGPU (Owens, 2008)
• Making it work at all:
– Primitive functionality and tools (graphics APIs)
– Comparisons with CPU not rigorous
3 Generations of GPGPU (Owens, 2008)
• Making it work at all:
– Primitive functionality and tools (graphics APIs)
– Comparisons with CPU not rigorous
• Making it work better:
– Easier to use (higher level APIs)
– Understanding of how best to do it
3 Generations of GPGPU (Owens, 2008)
• Making it work at all:
– Primitive functionality and tools (graphics APIs)
– Comparisons with CPU not rigorous
• Making it work better:
– Easier to use (higher level APIs)
– Understanding of how best to do it
• Doing it right:
– Stable, portable, modular building blocks

Source:
http://www.ece.ucdavis.edu/~jowens/talks/intel-
santaclara-070420.pdf
GPU – Programming for graphics
Courtesy, John Owens, UC Davis

Application specifies geometry – GPU

rasterizes

Each fragment is shaded (SIMD)

Shading can use values from memory

(textures)

Image can be stored for re-use

Source:
http://www.ece.ucdavis.edu/~jowens/talks/intel-
santaclara-070420.pdf
GPGPU programming (“old-school”)

Draw a quad

Run a SIMD program over each

fragment

Gather is permitted from texture memory

Resulting buffer can be stored for re-use

Courtesy, John Owens, UC Davis

NVIDIA G80 hardware implementation
• Vertex/fragment processors replaced by Unified Shaders
• Now view GPU as massively parallel co-processor
• Set of (16) SIMD MultiProcessors (8 cores)

Source:
http://www.ece.wisc.edu/~kati/fpga2008/fpga2008%20wo
rkshop%20-%2006%20NVIDIA%20-%20Kirk.pdf
NVIDIA G80 hardware implementation
Divide 128 cores into
16 Multiprocessors (MPs)

•Each MP has:
–Registers
–Shared memory
–Read only constant
cache
–Read only texture
cache
NVIDIA’s CUDA programming model
• G80 chip supports MANY active threads: 12,288
• Threads are lightweight:
– Little creation overhead
– “instant” switching
– Efficiency achieved through 1000’s of threads
• Threads are organised into blocks (1D, 2D, 3D)
• Blocks are further organised into a grid
Kernels, grids, blocks and threads
Kernels, grids, blocks and threads
• Organisation of threads and blocks is key abstraction
Kernels, grids, blocks and threads
• Organisation of threads and blocks is key abstraction
• Software:
– Threads from one block may cooperate:
• Using data in shared memory
• Through synchronising
Kernels, grids, blocks and threads
• Organisation of threads and blocks is key abstraction
• Software:
– Threads from one block may cooperate:
• Using data in shared memory
• Through synchronising
• Hardware:
– A block runs on one MP
– Hardware free to schedule any block on any MP
– More than one block can reside on one MP
Kernels, grids, blocks and threads
CUDA implementation
• CUDA implemented as extensions to C

• CUDA programs:
– explicitly manage host and device memory:
• allocation
• transfers
– set thread blocks and grid
– launch kernels
– are compiled with the CUDA nvcc compiler
Part 4: An example – CFD
Distribution function

f = f (c, x , t )
c is microscopic velocity

ρ = ∫ f dc

ρ u = ∫ cf d c
u is macroscopic velocity
Boltzmann equation
The evolution of f:

∂f ∂f
+ u ⋅ ∇f =
∂t ∂t collisions

Major simplification:

∂f 1
+ u ⋅ ∇f = − ( f − f )
eq

∂t τ
Lattice Boltzmann Method
Uniform mesh (lattice)
Lattice Boltzmann Method
Uniform mesh (lattice) Restrict microscopic velocities
to a finite set:

ρ = ∑ fα ρu = ∑ fα cα
α α
Macroscopic flow

For 2D, 9 velocities recover

• Isothermal, incompressible Navier-Stokes eqns

⎛ 1 ⎞ Δx 2
• With viscosity: ν = ⎜τ − ⎟
⎝ 2 ⎠ Δt
Solution procedure

1. Evaluate macroscopic properties:

ρ = ∑ fα ρu = ∑ fα cα
α α

2. Evaluate fαeq ( ρ , u )

3. Find

fα = fα −
*

τ
1
( f α − fα
eq
)
Solution procedure
Solution procedure

Simple prescriptions at
boundary nodes
CPU code: main.c
/* Memory allocation */
f0 = (float *)malloc(ni*nj*sizeof(float));
...

/* Main loop */
Stream (...args...);
Apply_BCs (...args...);
Collide (...args...);
GPU code: main.cu
/* allocate memory on host */
f0 = (float *)malloc(ni*nj*sizeof(float));

/* allocate memory on device */

cudaMallocPitch((void **)&f0_data, &pitch,
sizeof(float)*ni, nj);

cudaMallocArray(&f0_array, &desc, ni, nj);

/* Main loop */
Stream (...args...);
Apply_BCs (...args...);
Collide (...args...);
CPU code – collide.c
for (j=0; j<nj; j++) {
for (i=0; i<ni; i++) {
i2d = I2D(ni,i,j);
/* Flow properties */
density = ...function of f’s ...
vel_x = ... “
vel_y = ... “
/* Equilibrium f’s */
f0eq = ... function of density, vel_x, vel_y ...
f1eq = ... “
/* Collisions */
f0[i2d] = rtau1 * f0[i2d] + rtau * f0eq;
f1[i2d] = rtau1 * f1[i2d] + rtau * f1eq;
...
}
}
GPU code – collide.cu – kernel wrapper

void collide( ... args ...)

{
/* Set thread blocks and grid */
dim3 grid = dim3(ni/TILE_I, nj/TILE_J);
dim3 block = dim3(TILE_I, TILE_J);

/* Launch kernel */
collide_kernel<<<grid, block>>>(... args ...);

}
GPU code – collide.cu - kernel
/* Evaluate indices */
i = blockIdx.x*TILE_I + threadIdx.x;
j = blockIdx.y*TILE_J + threadIdx.y;
i2d = i + j*pitch/sizeof(float);
/* Read from device global memory */
f0now = f0_data[i2d];
f1now = f1_data[i2d];

/* Calc flow, feq, collide, as CPU code */

/* Write to device global memory */

f0_data[i2d] = rtau1 * f0now + rtau * f0eq;
f1_data[i2d] = rtau1 * f1now + rtau * f1eq;
GPU code – stream.cu – kernel wrapper
void stream( ... args ...)
{
/* Copy linear memory to CUDA array */
cudaMemcpy2DToArray(f1_array, 0, 0,
(void *)f1_data, pitch,sizeof(float)*ni, nj,
cudaMemcpyDeviceToDevice);
/* Make CUDA array a texture */
f1_tex.filterMode = cudaFilterModePoint;
cudaBindTextureToArray(f1_tex, f1_array));
/* Set threads and launch kernel */
dim3 grid = dim3(ni/TILE_I, nj/TILE_J);
dim3 block = dim3(TILE_I, TILE_J);
stream_kernel<<<grid, block>>>(... args ...);
}
GPU code – stream.cu – kernel
/* indices */

i = blockIdx.x*TILE_I + threadIdx.x;
j = blockIdx.y*TILE_J + threadIdx.y;
i2d = i + j*pitch/sizeof(float);

/* stream using texture fetches */

f1_data[i2d] = tex2D(f1_tex, (i-1), j);
f2_data[i2d] = tex2D(f2_tex, i, (j-1));
...
CPU / GPU demo
Results
• 2D Lattice Boltzmann code: 15x speedup GPU vs CPU

• Real CFD is more complex:

– more kernels
– 3D

• To improve performance, make use of shared memory

3D stencil operations
• Most CFD operations use nearest neighbour lookups
(stencil operations)

• e.g. 7 point stencil: centre point + 6 nearest neighbours

• Load data into shared memory

• Perform stencil ops
• Export results to device global memory
• Read in more data into shared memory
Stencil operations

3D sub-domain Threads in one plane

Source:
http://www.eng.cam.ac.uk/~gp10006/research/Brandvik_Pullan_2
008a_DRAFT.pdf
CUDA stencil kernel
__global__ void smooth_kernel(float sf, float
*a_data, float *b_data){

/* shared memory array */

__shared__ float a[16][3][5];
/* fetch first planes */
a[i][0][k] = a_data[i0m10];
a[i][1][k] = a_data[i000];
a[i][2][k] = a_data[i0p10];
__syncthreads();
/* compute */
b_data[i000] =
sf1*a[i][1][k] + sfd6*(a[im1][1][k] +
a[ip1][1][k] + a[i][0][k] +
a[i][2][k] + a[i][1][km1] + a[i][1][kp1])
/* load next "j" plane and repeat ...*/
Typical grid – CUDA partitioning
Typical grid – CUDA partitioning

Each colour to a different

multiprocessor
3D results

30x speedup GPU vs CPU

Part 5: NVIDIA – the only show in town?
NVIDIA

• 4 Tesla HPC GPUs

• 500 GFLOPs peak per GPU
• 1.5GB per GPU
AMD

• Firestream HPC GPU

• 500 GFLOPs
• 2GB
• available?
ClearSpeed

80 GFLOPs
35 W !
IBM Cell BE

25 x 8 GFLOPs
Chip comparison (Giles 2008)

Source:
http://www.cardiff.ac.uk/arcca/services/events/NovelArchitecture/
Mike-Giles.pdf
Too much choice!
• Each device has
– different hardware characteristics
– different software (C extensions)
– different developer tools

• How can we write code for all SIMD devices for all
applications?
Big picture – all devices, all problems?
Forget the big picture
Tackle the dwarves!
The View from Berkeley (7 “dwarves”)
1. Dense Linear Algebra
2. Sparse Linear Algebra
3. Spectral Methods
4. N-Body Methods
5. Structured Grids
6. Unstructured Grids
7. MapReduce

Source:
http://view.eecs.berkeley.edu/wiki/Main_Page
The View from Berkeley (13 dwarves?)
1. Dense Linear Algebra
2. Sparse Linear Algebra
3. Spectral Methods
4. N-Body Methods
5. Structured Grids
6. Unstructured Grids
7. MapReduce
8. Combinational Logic
9. Graph Traversal
10. Dynamic Programming
11. Backtrack and Branch-and-Bound
12. Graphical Models
13. Finite State Machines
The View from Berkeley (13 dwarves?)
1. Dense Linear Algebra
2. Sparse Linear Algebra
3. Spectral Methods
4. N-Body Methods
5. Structured Grids
6. Unstructured Grids
7. MapReduce
8. Combinational Logic
9. Graph Traversal
10. Dynamic Programming
11. Backtrack and Branch-and-Bound
12. Graphical Models
13. Finite State Machines
SBLOCK (Brandvik)
• Tackle structured grid, stencil operations dwarf
• Define kernel using high level Python abstraction
• Generate kernel for a range of devices from same
definition: CPU, GPU, Cell
• Use MPI to handle multiple devices
SBLOCK kernel definition
kind = "stencil"
bpin = ["a"]
bpout = ["b"]
lookup = ((1,0, 0), (0, 0, 0), (1,0, 0), (0, 1,0),
(0, 1, 0), (0, 0, 1), (0, 0, 1))
calc = {"lvalue": "b",
"rvalue": """sf1*a[0][0][0] +
sfd6*(a[1][0][0] + a[1][0][0] +
a[0][1][0] + a[0][1][0] +
a[0][0][1] + a[0][0][1])"""}
SBLOCK – CPU implementation (C)
void smooth(float sf, float *a, float *b)
{
for (k=0; k < nk; k++) {
for (j=0; j < nj; j++) {
for (i=0; i < ni; i++) {
/* compute indices i000, im100, etc */
b[i000] = sf1*a[i000] +
sfd6*(a[im100] + a[ip100] +
a[i0m10] + a[i0p10]
+ a[i00m1] + a[i00p1]);
}
}
}
}
SBLOCK – GPU implementation (CUDA)
__global__ void smooth_kernel(float sf, float
*a_data, float *b_data){

/* shared memory array */

So long as the task fits the dwarf:

• Programmer need not learn every device library

• Optimal device code is produced
• Code is future proofed (so long as back-ends are
available)
Part 6: Conclusions
Conclusions
• Many science applications fit the SIMD model
• GPUs are commodity SIMD chips
• Good speedups (10x – 100x) can be achieved
Conclusions
• Many science applications fit the SIMD model
• GPUs are commodity SIMD chips
• Good speedups (10x – 100x) can be achieved

• GPGPU is evolving (Owens, UC Davis):

1. Making it work at all (graphics APIs)
2. Doing it better (high level APIs)
3. Doing it right (portable, modular building blocks)
Conclusions
• Many science applications fit the SIMD model
• GPUs are commodity SIMD chips
• Good speedups (10x – 100x) can be achieved

• GPGPU is evolving (Owens, UC Davis):

1. Making it work at all (graphics APIs)
2. Doing it better (high level APIs)
3. Doing it right (portable, modular building blocks)
Acknowledgements and info
• Research student: Tobias Brandvik (CUED)
• Donation of GPU hardware: NVIDIA

http://dx.doi.org/10.1109/JPROC.2008.917757

http://www.gpgpu.org

http://www.oerc.ox.ac.uk/research/many-core-and-
reconfigurable-supercomputing

Physics Engine Applications & Challenges
No ratings yet
Physics Engine Applications & Challenges
6 pages
A Brief Introduction To 3d
100% (1)
A Brief Introduction To 3d
84 pages
S4421 Gpu Computing With Matlab
No ratings yet
S4421 Gpu Computing With Matlab
27 pages
Hackermonthly Issue051
No ratings yet
Hackermonthly Issue051
40 pages
GPU Programming in MATLAB
No ratings yet
GPU Programming in MATLAB
6 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Computer Graphics CSE 306
No ratings yet
Computer Graphics CSE 306
119 pages
Simplexity: Apprenticeship Pitch at The Beginning of The Semester
No ratings yet
Simplexity: Apprenticeship Pitch at The Beginning of The Semester
9 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Spatial Swarm Granulation
No ratings yet
Spatial Swarm Granulation
4 pages
3D Computer Graphics Alan Watt 3th Edition Selected Chapters
No ratings yet
3D Computer Graphics Alan Watt 3th Edition Selected Chapters
148 pages
Game Console - Graphics Team - Bibliographic Report
No ratings yet
Game Console - Graphics Team - Bibliographic Report
42 pages
FPGA VGA Image Display Tutorial
No ratings yet
FPGA VGA Image Display Tutorial
17 pages
Hey Friends Here Are List of Books Which Might Be Helpful For U All
No ratings yet
Hey Friends Here Are List of Books Which Might Be Helpful For U All
7 pages
Introduction To Computer Graphics PDF
No ratings yet
Introduction To Computer Graphics PDF
432 pages
Graphics Programming in C
No ratings yet
Graphics Programming in C
2 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Computer Graphics Viva Lab
No ratings yet
Computer Graphics Viva Lab
4 pages
Real-Time Image Warping with FPGA
No ratings yet
Real-Time Image Warping with FPGA
19 pages
Fractals: Siva Sankari, Sathya Meena, Aarthi, Ashok Kumar, Seetharaman Mohamad Habibur Rahman
No ratings yet
Fractals: Siva Sankari, Sathya Meena, Aarthi, Ashok Kumar, Seetharaman Mohamad Habibur Rahman
31 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Direct2D Succinctly PDF
No ratings yet
Direct2D Succinctly PDF
187 pages
Lecture 8c
100% (1)
Lecture 8c
29 pages
ECE 448 - FPGA and ASIC Design With VHDL: George Mason University
100% (1)
ECE 448 - FPGA and ASIC Design With VHDL: George Mason University
76 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
376 pages
Ray Tracing: 3D Rendering Thesis
No ratings yet
Ray Tracing: 3D Rendering Thesis
48 pages
Vulkan Tutorial PDF
No ratings yet
Vulkan Tutorial PDF
293 pages
OpenACC For Programmers 2018
No ratings yet
OpenACC For Programmers 2018
317 pages
Elec4602 Notes
No ratings yet
Elec4602 Notes
34 pages
Computer Graphics (Sem-VI) Godse (Technical Publications)
No ratings yet
Computer Graphics (Sem-VI) Godse (Technical Publications)
344 pages
Computer Graphics by Foley Dam Hughes
100% (2)
Computer Graphics by Foley Dam Hughes
205 pages
3D Mesh Processing and Character Animation - With Examples Using OpenGL, OpenMesh and Assimp
No ratings yet
3D Mesh Processing and Character Animation - With Examples Using OpenGL, OpenMesh and Assimp
209 pages
Evolution of The WAM:: Introduction To Prolog Implementation: The Warren Abstract Machine (WAM)
No ratings yet
Evolution of The WAM:: Introduction To Prolog Implementation: The Warren Abstract Machine (WAM)
21 pages
Csc534 With Java Book
100% (1)
Csc534 With Java Book
287 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
OpenCL for Programmers
No ratings yet
OpenCL for Programmers
13 pages
Introduction To GPU Architecture: © 2006 University of Central Florida
100% (1)
Introduction To GPU Architecture: © 2006 University of Central Florida
41 pages
A High-Performance Area-Efficient Multifunction Interpolator
No ratings yet
A High-Performance Area-Efficient Multifunction Interpolator
8 pages
Proceedings of The International Conference On Computational Creativity (ICCC-X, 7-9 January 2010, Lisbon, Portugal)
No ratings yet
Proceedings of The International Conference On Computational Creativity (ICCC-X, 7-9 January 2010, Lisbon, Portugal)
312 pages
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
100% (1)
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
44 pages
Introduction to Computer Graphics
No ratings yet
Introduction to Computer Graphics
385 pages
Julia for Scientific Computing Experts
No ratings yet
Julia for Scientific Computing Experts
34 pages
Verilog Project Report
No ratings yet
Verilog Project Report
13 pages
Triangle Rasterization & Mapping
No ratings yet
Triangle Rasterization & Mapping
85 pages
Course28-Advanced Real-Time Rendering in 3D Graphics and Games SIGGRAPH07
No ratings yet
Course28-Advanced Real-Time Rendering in 3D Graphics and Games SIGGRAPH07
144 pages
GPGPU
No ratings yet
GPGPU
139 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Topic 8
No ratings yet
Topic 8
71 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Note2 4
No ratings yet
Note2 4
11 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
Ada2024 Gpu 1
No ratings yet
Ada2024 Gpu 1
47 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Robotics 2 Gnuplot: Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard
No ratings yet
Robotics 2 Gnuplot: Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard
11 pages
Mark
No ratings yet
Mark
1 page
Traditionall Knowledgge
No ratings yet
Traditionall Knowledgge
4 pages
Clausis and Kelvin Law Thermodynamics
No ratings yet
Clausis and Kelvin Law Thermodynamics
4 pages
Clausis and Kelvin Law Thermodynamics
No ratings yet
Clausis and Kelvin Law Thermodynamics
4 pages
Insertion in A Sorted List: Program
No ratings yet
Insertion in A Sorted List: Program
6 pages
TCS Corporate Sustainability Report Low Resolution 09
No ratings yet
TCS Corporate Sustainability Report Low Resolution 09
171 pages
Environmental Noise Solutions
No ratings yet
Environmental Noise Solutions
6 pages
Fleet Analysis Case Study - Python
No ratings yet
Fleet Analysis Case Study - Python
2 pages
Volkswagen Workshop Service and Repair Manuals: Golf Mk4
No ratings yet
Volkswagen Workshop Service and Repair Manuals: Golf Mk4
3 pages
Two-Time Logistic Map
No ratings yet
Two-Time Logistic Map
9 pages
Pipe Fitting Installation Guide
No ratings yet
Pipe Fitting Installation Guide
1 page
IT Essentials (Version 5.0) - ITE Chapter 12: Assessment Results
No ratings yet
IT Essentials (Version 5.0) - ITE Chapter 12: Assessment Results
5 pages
Feasibility Study Template
No ratings yet
Feasibility Study Template
3 pages
Testo 454
No ratings yet
Testo 454
14 pages
Operation & Maintenance Manual: For Khartoum North Power Station-Phase Iii (2×100Mw) Project
No ratings yet
Operation & Maintenance Manual: For Khartoum North Power Station-Phase Iii (2×100Mw) Project
7 pages
Globe: Valves
No ratings yet
Globe: Valves
2 pages
3.i Work Study
No ratings yet
3.i Work Study
27 pages
Optical Fibre in Marine Applications
No ratings yet
Optical Fibre in Marine Applications
5 pages
Flight Control Design - Best Practices
No ratings yet
Flight Control Design - Best Practices
214 pages
Overlay and Asphalt Pavement Rehabilitation Manual
No ratings yet
Overlay and Asphalt Pavement Rehabilitation Manual
47 pages
A Case Study On-Shutdown Audit of AFBC Boiler
100% (2)
A Case Study On-Shutdown Audit of AFBC Boiler
12 pages
Al15 Kgdraft en Rev00 ZSP Datasheet Web
No ratings yet
Al15 Kgdraft en Rev00 ZSP Datasheet Web
2 pages
Example 1: Classify The Soil Samples Shown Below Using AASHTO and
No ratings yet
Example 1: Classify The Soil Samples Shown Below Using AASHTO and
7 pages
Lighting Lighting: Essential Smartbright G4 Led Floodlight Bvp150
No ratings yet
Lighting Lighting: Essential Smartbright G4 Led Floodlight Bvp150
2 pages
Design of Steel Tied Arch Bridges
No ratings yet
Design of Steel Tied Arch Bridges
98 pages
QSM201 Study Unit 3
No ratings yet
QSM201 Study Unit 3
15 pages
Relational Databases
No ratings yet
Relational Databases
18 pages
MPC Sample Problems and Solutions
No ratings yet
MPC Sample Problems and Solutions
20 pages
WS Services Pipe Fittings & Its Reference PDF
No ratings yet
WS Services Pipe Fittings & Its Reference PDF
1 page
737NG 09 FMS Nav
100% (1)
737NG 09 FMS Nav
9 pages
BSCS Data Communication Syllabus
No ratings yet
BSCS Data Communication Syllabus
7 pages
Motorcycle Safety Essentials
No ratings yet
Motorcycle Safety Essentials
10 pages
Pipe Schedules
No ratings yet
Pipe Schedules
1 page
TP Debug Info
No ratings yet
TP Debug Info
39 pages
2.sharadaa Castable Refractories
No ratings yet
2.sharadaa Castable Refractories
6 pages

Accelerating Scientific Computing with GPUs

Uploaded by

Accelerating Scientific Computing with GPUs

Uploaded by

Acceleration of scientific computing

using graphics hardware

You are here

You are here

Courtesy Vicente Jerez

Body-fitted mesh For each cell, conserve:

“Unsteady” models (with wakes, etc)

“OK, maybe a factor of two every two years.”

“The total of electricity consumed by major search

Source: NVIDIA CUDA SDK

GPUs are designed to apply the

GPUs are designed to apply the

This is what most scientific computing needs!

Application specifies geometry – GPU

Each fragment is shaded (SIMD)

Shading can use values from memory

Image can be stored for re-use

Run a SIMD program over each

Gather is permitted from texture memory

Resulting buffer can be stored for re-use

Courtesy, John Owens, UC Davis

For 2D, 9 velocities recover

• Isothermal, incompressible Navier-Stokes eqns

1. Evaluate macroscopic properties:

/* allocate memory on device */

cudaMallocArray(&f0_array, &desc, ni, nj);

void collide( ... args ...)

/* Calc flow, feq, collide, as CPU code */

/* Write to device global memory */

/* stream using texture fetches */

• Real CFD is more complex:

• To improve performance, make use of shared memory

• e.g. 7 point stencil: centre point + 6 nearest neighbours

• Load data into shared memory

3D sub-domain Threads in one plane

/* shared memory array */

Each colour to a different

30x speedup GPU vs CPU

• 4 Tesla HPC GPUs

• Firestream HPC GPU

/* shared memory array */

So long as the task fits the dwarf:

• Programmer need not learn every device library

• GPGPU is evolving (Owens, UC Davis):

• GPGPU is evolving (Owens, UC Davis):

You might also like