0% found this document useful (0 votes)

208 views28 pages

Cuda Examples

This document provides an overview of parallel programming with CUDA Fortran. It describes what CUDA Fortran is, shows simple examples of CUDA Fortran code, and discusses features such as variable qualifiers, function/subroutine qualifiers, and kernel loop directives.

Uploaded by

Riccardo Ferrero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

208 views28 pages

Cuda Examples

Uploaded by

Riccardo Ferrero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Parallel Programming

with CUDA Fortran

Outline

What is CUDA Fortran

Simple Examples
CUDA Fortran Features
Using CUBLAS with CUDA Fortran
Compilation

© NVIDIA Corporation 2011

CUDA Fortran

CUDA is a scalable programming model for parallel computing

CUDA Fortran is the Fortran analog of CUDA C

Program host and device code similar to CUDA C
Host code is based on Runtime API
Fortran language extensions to simplify data management

Co-defined by NVIDIA and PGI, implemented in the PGI Fortran

compiler
Separate from PGI Accelerator
Directive-based, OpenMP-like interface to CUDA

© NVIDIA Corporation 2011

CUDA Programming

Heterogeneous programming model

CPU and GPU are separate devices with separate memory spaces
Host code runs on the CPU
Handles data management for both the host and device
Launches kernels which are subroutines executed on the GPU
Device code runs on the GPU
Executed by many GPU threads in parallel
Allows for incremental development

© NVIDIA Corporation 2011

F90 Example

module simpleOps_m program incTest

contains use simpleOps_m
subroutine inc(a, b) implicit none
implicit none integer, parameter :: n = 256
integer :: a(:) integer :: a(n), b
integer :: b
integer :: i, n a = 1 ! array assignment
b = 3
n = size(a) call inc(a, b)
do i = 1, n
a(i) = a(i)+b if (all(a == 4)) then
enddo write(*,*) 'Success'
endif
end subroutine inc
end module simpleOps_m end program incTest

© NVIDIA Corporation 2011

CUDA Fortran - Host Code
CUDA Fortran F90
program incTest program incTest
use cudafor
use simpleOps_m use simpleOps_m
implicit none implicit none
integer, parameter :: n = 256 integer, parameter :: n = 256
integer :: a(n), b integer :: a(n), b
integer, device :: a_d(n)

a = 1 a = 1
b = 3 b = 3

a_d = a
call inc<<<1,n>>>(a_d, b) call inc(a, b)
a = a_d

if (all(a == 4)) then if (all(a == 4)) then

write(*,*) 'Success' write(*,*) 'Success'
endif endif
end program incTest end program incTest
© NVIDIA Corporation 2011
CUDA Fortran - Device Code

CUDA Fortran F90

module simpleOps_m module simpleOps_m
contains contains
attributes(global) subroutine inc(a, b) subroutine inc(a, b)
implicit none implicit none
integer :: a(:) integer :: a(:)
integer, value :: b integer :: b
integer :: i integer :: i, n
i = threadIdx%x n = size(a)
a(i) = a(i)+b do i = 1, n
a(i) = a(i)+b
enddo
end subroutine inc end subroutine inc
end module simpleOps_m end module simpleOps_m

© NVIDIA Corporation 2011

Extending to Larger Arrays

Previous example works for small arrays

call inc<<<1,n>>>(a_d,b)

Limit of n=1024 (Fermi) or n=512 (pre-Fermi)

For larger arrays, change the first Execution Configuration

parameter (<<<1,n>>>)

© NVIDIA Corporation 2011

Execution Model
Software Hardware
Threads are executed by thread processors
Thread
Thread Processor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on a

Thread multiprocessor
Block Multiprocessor

... A kernel is launched on a device as a

grid of thread blocks

Grid
Device
© NVIDIA Corporation 2011
Execution Configuration

Execution configuration specified on host code

call inc<<<blocksPerGrid, threadsPerBlock>>>(a_d,b)

Previous example used a single thread block

call inc<<<1,n>>>(a_d,b)

Multiple threads blocks

tPB = 256
call inc<<<ceiling(real(n)/tPB),tPB>>>(a_d,b)

© NVIDIA Corporation 2011

Large Array - Host Code
program incTest
use cudafor
use simpleOps_m
implicit none
integer, parameter :: n = 1024*1024
integer, parameter :: tPB = 256
integer :: a(n), b
integer, device :: a_d(n)
a = 1
b = 3
a_d = a
call inc<<<ceiling(real(n)/tPB),tPB>>>(a_d, b)
a = a_d
if (all(a == 4)) then
write(*,*) 'Success'
endif
end program incTest
© NVIDIA Corporation 2011
Large Array - Device Code

module simpleOps_m
contains
attributes(global) subroutine inc(a, b)
implicit none
integer :: a(:)
integer, value :: b
integer :: i, n
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
n = size(a)
if (i <= n) a(i) = a(i)+b
end subroutine inc
end module simpleOps_m

© NVIDIA Corporation 2011

Multidimensional Arrays - Host

Execution Configuration

call inc<<<blocksPerGrid, threadsPerBlock>>>(a_d,b)

Grid dimensions in blocks (blocksPerGrid) and block dimensions

(threadsPerBlock) can be either integer or of type dim3

type (dim3)
integer (kind=4) :: x, y, z
end type

© NVIDIA Corporation 2011

Multidimensional Arrays - Device

Predefined variables in device subroutines

Grid and block dimensions - gridDim, blockDim
Block and thread indices - blockIdx, threadIdx
Of type dim3
type (dim3)
integer (kind=4) :: x, y, z
end type

blockIdx and threadIdx fields have unit offset

1 <= blockIdx%x <= gridDim%x

© NVIDIA Corporation 2011

2D Example - Host Code
program incTest
use cudafor
use simpleOps_m
implicit none
integer, parameter :: nx=1024, ny=512
real :: a(nx,ny), b
real, device :: a_d(nx,ny)
type(dim3) :: grid, tBlock

a = 1; b = 3

tBlock = dim3(32,8,1)
grid = dim3(ceiling(real(nx)/tBlock%x), ceiling(real(ny)/tBlock%y), 1)
a_d = a
call inc<<<grid,tBlock>>>(a_d, b)
a = a_d

write(,) 'Max error: ', maxval(abs(a-4))

end program incTest

© NVIDIA Corporation 2011

2D Example - Device Code

module simpleOps_m
contains
attributes(global) subroutine inc(a, b)
implicit none
real :: a(:,:)
real, value :: b
integer :: i, j

i = (blockIdx%x-1)*blockDim%x + threadIdx%x
j = (blockIdx%y-1)*blockDim%y + threadIdx%y

if (i<=size(a,1) .and. j<=size(a,2)) &

a(i,j) = a(i,j) + b

end subroutine inc

end module simpleOps_m

© NVIDIA Corporation 2011

CUDA Fortran Features

Variable Qualifiers
Subroutine/Function Qualifiers
Kernel Loop Directives (CUF Kernels)

Variable Qualifiers

Analogous to CUDA C
device
constant
Read-only memory (device code) cached on-chip
shared
On-chip, shared between threads of a thread block
Additional
pinned
Page-locked host memory
value
Pass-by-value dummy arguments in device code
Textures will be available in 12.0
© NVIDIA Corporation 2011
Function/Subroutine Qualifiers

Designated by attributes() specifier

attributes(host)
called from host and runs on host (default)
attributes(global)
kernel, called from host runs on device
subroutine only
no other prefixes allowed (recursive, elemental, or pure)
attributes(device)
called from and runs on device
can only appear within a Fortran module
only additional prefix allowed is function return type

Kernel Loop Directives (CUF Kernels)

Automatic kernel generation and invocation of host code region

containing tightly nested loops
!$cuf kernel do(2) <<< *,* >>>
do j=1, ny
do i = 1, nx
a_d(i,j) = b_d(i,j) + c_d(i,j)
enddo
enddo

Can specify parts of execution configuration

!$cuf kernel do(2) <<<(*,*),(32,4)>>>

Reduction using CUF Kernels

Compiler recognizes use of scalar reduction and generates one

result

rsum = 0.0
!$cuf kernel do <<<*,*>>>
do i = 1, nx
rsum = rsum + a_d(i)
enddo

Calling CUBLAS from CUDA Fortran

Module which defines interfaces to CUBLAS from CUDA Fortran

use cublas
Interfaces in three forms
Overloaded BLAS interfaces that take device array arguments
call saxpy(n, a_d, x_d, incx, y_d, incy)
Legacy CUBLAS interfaces
call cublasSaxpy(n, a_d, x_d, incx, y_d, incy)
Multi-GPU version (CUDA 4.0) that utilizes a handle h
istat = cublasSaxpy_v2(h, n, a_d, x_d, incx, y_d, incy)
Mixing the three forms is allowed

Calling CUBLAS from CUDA Fortran
program cublasTest
use cublas
implicit none

real, allocatable :: a(:,:),b(:,:),c(:,:)

real, device, allocatable :: a_d(:,:),b_d(:,:),c_d(:,:)
integer :: k=4, m=4, n=4
real :: alpha=1.0, beta=2.0, maxError

allocate(a(m,k), b(k,n), c(m,n), a_d(m,k), b_d(k,n), c_d(m,n))

a = 1; a_d = a
b = 2; b_d = b
c = 3; c_d = c

call cublasSgemm('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)

c=c_d
write(*,*) 'Maximum error: ', maxval(abs(c-14.0))

deallocate (a,b,c,a_d,b_d,c_d)

end program cublasTest

Source-to-source compilation (generates CUDA C)

pgfortran - PGI’s Fortran compiler
All source code with .cuf or .CUF is compiled as CUDA Fortran enabled
automatically
Flag to target architecture (eg. -Mcuda=cc20)
-Mcuda=emu specifies emulation mode
Flag to target toolkit version (eg. -Mcuda=cuda4.0)
-Mcuda=fastmath enables faster intrinsics (__sinf())
-Mcuda=nofma turns off fused multiply-add
-Mcuda=maxregcount:<n> limits register use per thread
-Mcuda=ptxinfo prints memory usage per kernel

Summary

CUDA Fortran provides a convenient interface for parallel

programming
Fortran analog to CUDA C
CUDA Fortran has strong typing that allows simplified data management
Fortran 90’s array features carried to GPU
More info available at
http://www.pgroup.com/cudafortran

Parallel Programming
with CUDA Fortran
Runtime API (Host)

Runtime API defined in cudafor module

Device management (cudaGetDeviceCount, cudaSetDevice, ...)
Host-device synchronization (cudaDeviceSynchronize)
Memory management (cudaMalloc/cudaFree, cudaMemcpy,
cudaMemcpyAsync, ...)
Mixing cudaMalloc/cudaFree with Fortran allocate/deallocate on a
given array is not supported
For device data, counts are in units of elements, not bytes
Stream management (cudaStreamCreate, cudaStreamSynchronize, ...)
Event management (cudaEventCreate, cudaEventRecord, ...)
Error handling (cudaGetLastError, ...)

Device Intrinsics

syncthreads subroutine
Barrier synchronization for all threads in thread block
gpu_time subroutine
Returns value of clock cycle counter on GPU
Atomic functions

CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Fortran
No ratings yet
CUDA Fortran
88 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
CUDA Libraries for Developers
No ratings yet
CUDA Libraries for Developers
86 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
GP-GPU Acceleration in WRF
No ratings yet
GP-GPU Acceleration in WRF
22 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Threads
No ratings yet
Threads
54 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Class 10
No ratings yet
Class 10
13 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
CUDA Class Lecture03
No ratings yet
CUDA Class Lecture03
18 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
CUDA for Developers and Engineers
No ratings yet
CUDA for Developers and Engineers
28 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Exercise Instructions
No ratings yet
Exercise Instructions
12 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Accelerating Scientific Computing with GPUs
100% (2)
Accelerating Scientific Computing with GPUs
96 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA
No ratings yet
CUDA
18 pages
CUDA Class Lecture01
No ratings yet
CUDA Class Lecture01
26 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Deep Learning Practical Examples Ebook PDF
No ratings yet
Deep Learning Practical Examples Ebook PDF
33 pages
Wizard 0th LVL Spells
No ratings yet
Wizard 0th LVL Spells
6 pages
D&D NPC Stats: Commoner, Guard, Knight
No ratings yet
D&D NPC Stats: Commoner, Guard, Knight
1 page
Numerical Study of The Fluid-Dynamics in Coronary Artery Bypass Grafts
No ratings yet
Numerical Study of The Fluid-Dynamics in Coronary Artery Bypass Grafts
5 pages
D&D 4.0 - Divine Power PDF
100% (2)
D&D 4.0 - Divine Power PDF
160 pages
Final Research 13
No ratings yet
Final Research 13
20 pages
Aggregate & Capacity Planning Guide
100% (2)
Aggregate & Capacity Planning Guide
10 pages
Credit Co Operative Society
No ratings yet
Credit Co Operative Society
13 pages
Grade 7 PE VPA Paper 1 Midyear 2024
75% (4)
Grade 7 PE VPA Paper 1 Midyear 2024
4 pages
Winterhalter Glasswasher Operating Instructions Gs202 Gs215
No ratings yet
Winterhalter Glasswasher Operating Instructions Gs202 Gs215
19 pages
Sade Assignment Full
No ratings yet
Sade Assignment Full
12 pages
Mfat Action Plan Implementation Sy 2021-2022
No ratings yet
Mfat Action Plan Implementation Sy 2021-2022
1 page
Preliminary Energy Audit On The Residential House
No ratings yet
Preliminary Energy Audit On The Residential House
33 pages
7 Cs For Creating Effective Communicatio
No ratings yet
7 Cs For Creating Effective Communicatio
16 pages
Storytelling and Worksheet
No ratings yet
Storytelling and Worksheet
3 pages
Equivalent Table
88% (8)
Equivalent Table
1 page
3 Q2-Perdev
No ratings yet
3 Q2-Perdev
15 pages
Automatic Power Cut-Off Device For Electricity Distribution Board
No ratings yet
Automatic Power Cut-Off Device For Electricity Distribution Board
28 pages
Ore Grindability and Testing Methods
No ratings yet
Ore Grindability and Testing Methods
8 pages
Elms Activity 2
No ratings yet
Elms Activity 2
2 pages
Footloose
No ratings yet
Footloose
22 pages
Greške Po Standardu Bogdan 08 11 22
No ratings yet
Greške Po Standardu Bogdan 08 11 22
34 pages
Evaluation SCIENCE Layers of The Earth - Tectonic Plates
No ratings yet
Evaluation SCIENCE Layers of The Earth - Tectonic Plates
3 pages
Course Log - Theory of Programming Languages
No ratings yet
Course Log - Theory of Programming Languages
6 pages
Student Centered Learning Toolkit
No ratings yet
Student Centered Learning Toolkit
72 pages
From The Books To The Streets
No ratings yet
From The Books To The Streets
31 pages
Lecture 1 Parts of Speech
No ratings yet
Lecture 1 Parts of Speech
16 pages
Creating Effective Test Specifications
No ratings yet
Creating Effective Test Specifications
25 pages
Solid Waste Landfilling Concepts Processes Technology 1st Edition HQ File Fast Access
No ratings yet
Solid Waste Landfilling Concepts Processes Technology 1st Edition HQ File Fast Access
315 pages
A Millionaire Mind Affirmations
67% (3)
A Millionaire Mind Affirmations
2 pages
Foundation Fieldbus Installation and Best Practices
No ratings yet
Foundation Fieldbus Installation and Best Practices
49 pages
Quick Study Guide To The Endocrine System
No ratings yet
Quick Study Guide To The Endocrine System
11 pages
Rating Scale For Student Teachers
100% (3)
Rating Scale For Student Teachers
3 pages
Core House - Neue Nationalgalarie
No ratings yet
Core House - Neue Nationalgalarie
46 pages
CASE 12-159347 Redacted
No ratings yet
CASE 12-159347 Redacted
5 pages

Cuda Examples

Uploaded by

Cuda Examples

Uploaded by

Parallel Programming

with CUDA Fortran

What is CUDA Fortran

© NVIDIA Corporation 2011

CUDA is a scalable programming model for parallel computing

CUDA Fortran is the Fortran analog of CUDA C

Co-defined by NVIDIA and PGI, implemented in the PGI Fortran

© NVIDIA Corporation 2011

Heterogeneous programming model

© NVIDIA Corporation 2011

module simpleOps_m program incTest

© NVIDIA Corporation 2011

if (all(a == 4)) then if (all(a == 4)) then

CUDA Fortran F90

© NVIDIA Corporation 2011

Previous example works for small arrays

Limit of n=1024 (Fermi) or n=512 (pre-Fermi)

For larger arrays, change the first Execution Configuration

© NVIDIA Corporation 2011

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on a

... A kernel is launched on a device as a

Execution configuration specified on host code

call inc<<<blocksPerGrid, threadsPerBlock>>>(a_d,b)

Previous example used a single thread block

Multiple threads blocks

© NVIDIA Corporation 2011

© NVIDIA Corporation 2011

call inc<<<blocksPerGrid, threadsPerBlock>>>(a_d,b)

Grid dimensions in blocks (blocksPerGrid) and block dimensions

© NVIDIA Corporation 2011

Predefined variables in device subroutines

blockIdx and threadIdx fields have unit offset

1 <= blockIdx%x <= gridDim%x

© NVIDIA Corporation 2011

write(*,*) 'Max error: ', maxval(abs(a-4))

© NVIDIA Corporation 2011

if (i<=size(a,1) .and. j<=size(a,2)) &

end subroutine inc

© NVIDIA Corporation 2011

© NVIDIA Corporation 2011

Designated by attributes() specifier

© NVIDIA Corporation 2011

Automatic kernel generation and invocation of host code region

Can specify parts of execution configuration

© NVIDIA Corporation 2011

Compiler recognizes use of scalar reduction and generates one

© NVIDIA Corporation 2011

Module which defines interfaces to CUBLAS from CUDA Fortran

© NVIDIA Corporation 2011

real, allocatable :: a(:,:),b(:,:),c(:,:)

allocate(a(m,k), b(k,n), c(m,n), a_d(m,k), b_d(k,n), c_d(m,n))

end program cublasTest

Source-to-source compilation (generates CUDA C)

© NVIDIA Corporation 2011

CUDA Fortran provides a convenient interface for parallel

© NVIDIA Corporation 2011

Runtime API defined in cudafor module

© NVIDIA Corporation 2011

© NVIDIA Corporation 2011

You might also like

write(,) 'Max error: ', maxval(abs(a-4))