KEMBAR78
Nvidia Opencl Best Practices Guide: Optimization | PDF | Parallel Computing | Thread (Computing)
0% found this document useful (0 votes)
98 views49 pages

Nvidia Opencl Best Practices Guide: Optimization

Nvidia tips

Uploaded by

Yttria Therbium
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views49 pages

Nvidia Opencl Best Practices Guide: Optimization

Nvidia tips

Uploaded by

Yttria Therbium
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Optimization

NVIDIA OpenCL
Best Practices Guide

Version 2.3

August 31, 2009


NVIDIA OpenCL Best Practices Guide

REVISIONS

Original release: July 2009

ii August 31, 2009


Table of Contents

Preface............................................................................................................................ v 
What Is This Document? v 
Who Should Read This Guide? v 
Recommendations and Best Practices v 
Contents Summary vi 
Chapter 1. Heterogeneous Computing with OpenCL .....................................................1 
1.1  Differences Between Host and Device 1 
1.2  What Runs on an OpenCL-Enabled Device? 2 
1.3  Maximum Performance Benefit 3 
Chapter 2. Performance Metrics .....................................................................................5 
2.1 Timing 5 
2.1.1 Using CPU Timers 5 
2.1.2 Using OpenCL GPU Timers 6 
2.2 Bandwidth 6 
2.2.1 Theoretical Bandwidth Calculation 6 
2.2.2 Effective Bandwidth Calculation 7 
2.2.3 Throughput Reported by the OpenCL Visual Profiler 7 
Chapter 3. Memory Optimizations..................................................................................9 
3.1 Data Transfer Between Host and Device 9 
3.1.1 Pinned Memory 9 
3.2 Device Memory Spaces 12 
3.2.1 Coalesced Access to Global Memory 13 
3.2.1.1 A Simple Access Pattern 14 
3.2.1.2 A Sequential but Misaligned Access Pattern 14 
3.2.1.3 Effects of Misaligned Accesses 15 
3.2.1.4 Strided Accesses 17 
3.2.2 Shared Memory 18 
3.2.2.1 Shared Memory and Memory Banks 18 
3.2.2.2 Shared Memory in Matrix Multiplication (C = AB) 19 
T
3.2.2.3 Shared Memory in Matrix Multiplication (C = AA ) 23 
3.2.2.4 Shared Memory Use by Kernel Arguments 25 
3.2.3 Local Memory 25 

August 31, 2009 iii


NVIDIA OpenCL Best Practices Guide

3.2.4 Texture Memory 26 


3.2.4.1 Textured Fetch vs. Global Memory Read 26 
3.2.4.2 Additional Texture Capabilities 26 
3.2.5 Constant Memory 27 
3.2.6 Registers 27 
3.2.6.1 Register Pressure 27 
Chapter 4. NDRange Optimizations..............................................................................29 
4.1 Occupancy 29 
4.2 Calculating Occupancy 29 
4.3 Hiding Register Dependencies 31 
4.4 Thread and Block Heuristics 31 
4.5 Effects of Shared Memory 32 
Chapter 5. Instruction Optimizations ...........................................................................35 
5.1 Arithmetic Instructions 35 
5.1.1 Division and Modulo Operations 36 
5.1.2 Reciprocal Square Root 36 
5.1.3 Other Arithmetic Instructions 36 
5.1.4 Math Libraries 37 
5.2 Memory Instructions 37 
Chapter 6. Control Flow................................................................................................39 
6.1  Branching and Divergence 39 
6.2  Branch Predication 39 
Appendix A. Recommendations and Best Practices .....................................................41 
A.1 Overall Performance Optimization Strategies 41 
A.2 High-Priority Recommendations 42 
A.3 Medium-Priority Recommendations 42 
A.4 Low-Priority Recommendations 42 

iv August 31, 2009


Preface

What Is This Document?


This Best Practices Guide is a manual to help developers obtain the best performance
from the NVIDIA® CUDA™ architecture using OpenCL. It presents established
optimization techniques and explains coding metaphors and idioms that can greatly
simplify programming for the CUDA architecture.
While the contents can be used as a reference manual, you should be aware that
some topics are revisited in different contexts as various programming and
configuration topics are explored. As a result, it is recommended that first-time
readers proceed through the guide sequentially. This approach will greatly improve
your understanding of effective programming practices and enable you to better use
the guide for reference later.

Who Should Read This Guide?


This guide is intended for programmers who have basic familiarity with OpenCL
and have already written successful OpenCL programs.
It refers to and relies on several other documents that you should have at your
disposal for reference:
‰ NVIDIA OpenCL Getting Started Guide
‰ NVIDIA OpenCL Programming Guide
‰ OpenCL Specification
In particular, it assumes knowledge of the mapping between the OpenCL and the
CUDA architecture and terminology described in Chapter 2 of the NVIDIA
OpenCL programming guide.

Recommendations and Best Practices


Throughout this guide, specific recommendations are made regarding the design
and implementation of OpenCL application. These recommendations are
categorized by priority, which is a blend of the effect of the recommendation and its
scope. Actions that present substantial improvements for most OpenCL

August 31, 2009 v


NVIDIA OpenCL Best Practices Guide

applications have the highest priority, while small optimizations that affect only very
specific situations are given a lower priority.
Before implementing lower priority recommendations, it is good practice to make
sure all higher priority recommendations that are relevant have already been applied.
This approach will tend to provide the best results for the time invested and will
avoid the trap of premature optimization.
The criteria of benefit and scope for establishing priority will vary depending on the
nature of the program. In this guide, they represent a typical case. Your code might
reflect different priority factors. Regardless of this possibility, it is good practice to
verify that no higher priority recommendations have been overlooked before
undertaking lower priority items.
Appendix A of this document lists all the recommendations and best practices,
grouping them by priority and adding some additional helpful observations.

Contents Summary
The remainder of this guide is divided into the following sections:
‰ Introduction to Parallel Computing with OpenCL: Important aspects of the
parallel programming architecture.
‰ Performance Metrics: How should performance be measured in OpenCL
applications and what are the factors that most influence performance?
‰ Memory Optimizations: Correct memory management is one of the most
effective means of improving performance. This chapter explores the different
kinds of memory available to OpenCL applications, and it explains in detail how
memory is handled behind the scenes.
‰ NDRanges Optimizations: How to make sure your OpenCL application is
exploiting all the available resources on the GPU.
‰ Instruction Optimizations: Certain operations run faster than others. Using
faster operations and avoiding slower ones often confers remarkable benefits.
‰ Control Flow: Carelessly designed control flow can force parallel code into
serial execution; whereas thoughtfully designed control flow can help the
hardware perform the maximum amount of work per clock cycle.
‰ Getting the Right Answer: How to debug code and how to handle differences
in how the CPU and GPU represent floating-point values.

vi August 31, 2009


Chapter 1.
Heterogeneous Computing with OpenCL

OpenCL programming involves running code on two different platforms: a host


system that relies on one or more CPUs to perform calculations, and a card
(frequently a graphics adapter) with one or more OpenCL-enabled NVIDIA GPUs
(the device).
While NVIDIA devices are primarily associated with rendering graphics, they also
are powerful arithmetic engines capable of running thousands of lightweight threads
in parallel. This capability makes them well suited to computations that can leverage
parallel execution well.
However, the device is based on a distinctly different design from the host system
and, to use OpenCL effectively, it’s important to understand those differences and
how they determine the performance of OpenCL applications.

1.1 Differences Between Host and Device


The primary differences occur in threading and memory access:
‰ Threading resources. Execution pipelines on host systems can support a
limited number of concurrent threads. Servers that have four quad-core
processors today can run only 16 threads in parallel (32 if the CPUs support
HyperThreading.) By comparison, the smallest executable unit of parallelism on
a device, called a warp, comprises 32 threads. All NVIDIA GPUs can support
768 active threads per multiprocessor, and some GPUs support 1,024 active
threads per multiprocessor. On devices that have 30 multiprocessors (such as
the NVIDIA® GeForce® GTX 280), this leads to more than 30,000 active
threads. In addition, devices can hold literally billions of threads scheduled to
run on these GPUs.
‰ Threads. Threads on a CPU are generally heavyweight entities. The operating
system must swap threads on and off execution channels to provide
multithreading capability. Context switches (when two threads are swapped) are
therefore slow and expensive. By comparison, GPUs run extremely lightweight
threads. In a typical system, hundreds of threads are queued up for work (in
warps of 32 threads). If the GPU processor must wait on one warp of threads,
it simply begins executing work on another. Because registers are allocated to
active threads, no swapping of registers and state occurs between GPU threads.
Resources stay allocated to the thread until it completes its execution.
‰ RAM. Both the host system and the device have RAM. On the host system,
RAM is generally equally accessible to all code (within the limitations enforced
by the operating system). On the device, RAM is divided virtually and physically

August 31, 2009 1


NVIDIA OpenCL Best Practices Guide

into different types, each of which has a special purpose and fulfills different
needs. The types of device RAM are explained in the NVIDIA OpenCL
Programming Guide and in Chapter 3 of this document.
These are the primary hardware differences between CPU hosts and GPU devices
with respect to parallel programming. Other differences are discussed as they arise
elsewhere in this document.

1.2 What Runs on an OpenCL-Enabled Device?


Because of the considerable differences between host and device, it’s important to
partition applications so that each hardware system is doing the work it does best.
The following issues should be considered when determining what parts of an
application to run on the device:
‰ The device is ideally suited for computations that can be run in parallel. That is,
data parallelism is optimally handled on the device. This typically involves
arithmetic on large data sets (such as matrices), where the same operation can
be performed across thousands, if not millions, of elements at the same time.
This is a requirement of good performance on OpenCL-enabled devices: The
software must use a large number of threads. The support for running
numerous threads in parallel derives from the CUDA architecture’s use of a
lightweight threading model.
‰ There should be some coherence in memory access by a kernel. Certain
memory access patterns enable the hardware to coalesce groups of data items to
be written and read in one operation. Data that cannot be laid out so as to
enable coalescing, or that doesn’t have enough locality to use textures
efficiently, will not enjoy much of a performance lift when used in
computations on OpenCL-enabled devices.
‰ Traffic along the Peripheral Component Interconnect (PCI) bus should be
minimized. In OpenCL, data values must be transferred from the host to the
device. These transfers are costly in terms of performance and so they should
be minimized. (See section 3.1.) This cost has several ramifications:
¾ The complexity of operations should justify the cost of moving data to the
device. Code that transfers data for brief use by a small number of threads
will see little or no performance lift. The ideal scenario is one in which many
threads perform a substantial amount of work.
For example, transferring two matrices to the device to perform a matrix
addition and then transferring the results back to the host will not realize
much performance benefit. The issue here is the number of operations
performed per data element transferred. For the preceding procedure,
assuming matrices of size NxN, there are N2 operations (additions) and 3N2
elements transferred, so the operations-to-transfer ratio is 1:3 or O(1).
Performance benefits can be more readily achieved when the ratio of
operations to elements transferred is higher. For example, a matrix
multiplication of the same matrices requires N3 operations (multiply-add), so
the ratio of operations to element transferred is O(N), in which case the
larger the matrix the greater the performance benefit. The types of

2 August 31, 2009


Introduction to Parallel Computing with OpenCL

operations are an additional factor, as additions versus trigonometric


functions have different complexity profiles. It is important to include
transfers to and from the device in determining where operations should be
performed.
¾ Data should be kept on the device as long as possible. Because transfers
should be minimized, programs that run multiple kernels on the same data
should favor leaving the data on the device between kernel calls, rather than
transferring intermediate results to the host and then sending them back to
the device for subsequent calculations. So if the data were already on the
device in the previous example, the matrix addition should be performed
locally on the device. This approach should be used even if one of the steps
in a sequence of calculations could be performed faster on the host. Even a
relatively slow kernel may be advantageous if it avoids one or more PCI
Express (PCIe) transfers. Section 3.1 provides further details, including the
measurements of bandwidth between host and device versus within the
device proper.

1.3 Maximum Performance Benefit

High Priority: To get the maximum benefit from OpenCL, focus first on finding ways
to parallelize sequential code.

The amount of performance benefit an application will realize by using OpenCL


depends entirely on the extent to which it can be parallelized. As mentioned
previously, code that cannot be sufficiently parallelized should run on the host,
unless doing so would result in excessive transfers between host and device.
Amdahl’s law specifies the maximum speed-up that can be expected by parallelizing
portions of a serial program. Essentially, it states that the maximum speed-up (S) of
a program is
1
1
where P is the fraction of the total serial execution time taken by the portion of code
that can be parallelized and N is the number of processors over which the parallel
portion of the code runs.
The larger N is (that is, the greater the number of processors), the smaller the P/N
fraction. It can be simpler to view N as a very large number, which essentially
transforms the equation into S 1 / 1 P. Now, if ¾ of a program is parallelized,
the maximum speed-up over serial code is 1 / 1 – ¾ = 4.
For most purposes, the key point is that the greater P is, the greater the speed-up.
An additional caveat is implicit in this equation, which is that if P is a small number
(so not substantially parallel), increasing N does little to improve performance. To
get the largest lift, best practices suggest spending most effort on increasing P; that
is, by maximizing the amount of code that can be parallelized.

August 31, 2009 3


Chapter 2.
Performance Metrics

When attempting to optimize OpenCL code, it pays to know how to measure


performance accurately and to understand the role that bandwidth plays in
performance measurement. This chapter discusses how to correctly measure
performance using CPU timers and OpenCL events. It then explores how
bandwidth affects performance metrics and how to mitigate some of the challenges
it poses.

2.1 Timing
OpenCL calls and kernel executions can be timed using either CPU or GPU timers.
This section examines the functionality, advantages, and pitfalls of both approaches.

2.1.1 Using CPU Timers


Any CPU timer can be used to measure the elapsed time of an OpenCL call. The
details of various CPU timing approaches are outside the scope of this document,
but developers should always be aware of the resolution their timing calls provide.
When using CPU timers, it is critical to remember that some OpenCL function calls
can be non-blocking; that is, they return control back to the calling CPU thread
prior to completing their work. All kernel execution enqueue calls are non-blocking;
so are all memory transfer enqueue calls with the blocking parameter set to true.
Therefore, to accurately measure the elapsed time for a particular call or sequence of
OpenCL calls, it is necessary to synchronize the CPU thread with the GPU by
calling clFinish() for all command queues immediately before starting and
stopping the CPU timer. clFinish()blocks the calling CPU thread until all
OpenCL calls previously issued by the thread are completed.
Although it is also possible to synchronize the CPU thread with a particular
command queue or event on the GPU, these synchronization functions are not
suitable for timing code in a specific command queue. clFinish() blocks the CPU
thread until all OpenCL commands previously enqueued into the given queue have
completed. clWaitForEvents() can be used to block until some events in a
particular command queue have been recorded by the GPU. Because the driver may
interleave execution of OpenCL calls from different command queues, calls in other
command queue may be included in the timing.

August 31, 2009 5


NVIDIA OpenCL Best Practices Guide

2.1.2 Using OpenCL GPU Timers


Each enqueue call optionally returns an event object that uniquely identifies the
enqueued command. The event object of a command can be used to measure its
execution time if as detailed in Section 5.9 and illustrated in Listing 2.1. Profiling can
be enabled by setting the CL_QUEUE_PROFILING_ENABLE flag in properties
argument of either clCreateCommandQueue or clSetCommandQueueProperty.
cl_ulong start, end;

clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &start, NULL);

float executionTimeInMilliseconds = (end - start) * 1.0e-6f;

Listing 2.1 How to time code using OpenCL events

Note that the timings are measured on the GPU clock, and so are operating system–
independent. The resolution of the GPU timer is approximately half a microsecond.

2.2 Bandwidth
Bandwidth is one of the most important gating factors for performance. Almost all
changes to code should be made in the context of how they affect bandwidth. As
described in Chapter 3 of this guide, bandwidth can be dramatically affected by the
choice of memory in which data is stored, how the data is stored and accessed, as
well as other factors.
To measure performance accurately, it is useful to calculate theoretical and effective
bandwidth. When the latter is much lower than the former, design or
implementation details are likely to reduce bandwidth, and it should be the primary
goal of subsequent optimization efforts to increase it.

High Priority: Use the effective bandwidth of your computation as a metric when
measuring performance and optimization benefits.

2.2.1 Theoretical Bandwidth Calculation


Theoretical bandwidth can be calculated using hardware specifications available in
the product literature. For example, the NVIDIA GeForce GTX 280 uses DDR
(double data rate) RAM with a memory clock rate of 1,107 MHz and a 512-bit wide
memory interface.
Using these data items, the peak theoretical memory bandwidth of the NVIDIA
GeForce GTX 280 is
1107 x 106 x 512/8 x 2 / 109 141.6 GB/sec

6 August 31, 2009


Performance Metrics

In this calculation, the memory clock rate is converted in to Hz, multiplied by the
interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the
double data rate. Finally, this product is divided by 109 to convert the result to
GB/sec (GBps).
Note that some calculations use 1,0243 instead of 109 for the final calculation. In
such a case, the bandwidth would be 131.9 GBps. It is important to use the same
divisor when calculating theoretical and effective bandwidth, so that the comparison
is valid.

2.2.2 Effective Bandwidth Calculation


Effective bandwidth is calculated by timing specific program activities and by
knowing how data is accessed by the program. To do so, use this equation
Effective bandwidth = Br Bw / 109 / time
where the effective bandwidth is in units of GBps, Br is the number of bytes read
per kernel, Bw is the number of bytes written per kernel, and time is given in
seconds.
For example, to compute the effective bandwidth of a 2048 x 2048 matrix copy, the
following formula could be used:
Effective bandwidth 20482 x 4 x 2 / 109 / time
The number of elements is multiplied by the size of each element (4 bytes for a
float), multiplied by 2 (because of the read and write), divided by 109 (or 1,0243) to
obtain GB of memory transferred. This number is divided by the time in seconds to
obtain GBps.

2.2.3 Throughput Reported by the OpenCL Visual Profiler


The memory throughput reported in the summary table of the OpenCL visual
profiler, differs from the effective bandwidth obtained by the calculation in section
2.2.2 in several respects.
The first difference is that the OpenCL Visual Profiler measures throughput using a
subset of the GPU’s multiprocessors and then extrapolates that number to the
entire GPU, thus reporting an estimate of the data throughput.
The second and more important difference is that because the minimum memory
transaction size is larger than most word sizes, the memory throughput reported by
the profiler includes the transfer of data not used by the kernel.
The effective bandwidth calculation in section 2.2.2, however, includes only data
transfers that are relevant to the algorithm. As such, the effective bandwidth will be
smaller than the memory throughput reported by the OpenCL Visual Profiler and is
the number to use when optimizing memory performance.
However, it’s important to note that both numbers are useful. The profiler memory
throughput shows how close the code is to the hardware limit, and the comparison
of the effective bandwidth with the profiler number presents a good estimate of
how much bandwidth is wasted by suboptimal coalescing of memory accesses.

August 31, 2009 7


Chapter 3.
Memory Optimizations

Memory optimizations are the most important area for performance. The goal is to
maximize the use of the hardware by maximizing bandwidth. Bandwidth is best
served by using as much fast memory and as little slow-access memory as possible.
This chapter discusses the various kinds of memory on the host and device and how
best to set up data items to use the memory effectively.

3.1 Data Transfer Between Host and Device


The bandwidth between the device memory and the GPU is much higher
(141 GBps on the NVIDIA GeForce GTX 280, for example) than the bandwidth
between host memory and device memory (8 GBps on the PCI Express ×16 Gen2).
Hence, for best overall application performance, it is important to minimize data
transfer between the host and the device, even if that means running kernels on the
GPU that do not demonstrate any speed-up compared with running them on the
host CPU.

High Priority: Minimize data transfer between the host and the device, even if it
means running some kernels on the device that do not show performance gains when
compared with running them on the host CPU.

Intermediate data structures should be created in device memory, operated on by


the device, and destroyed without ever being mapped by the host or copied to host
memory.
Also, because of the overhead associated with each transfer, batching many small
transfers into one larger transfer performs significantly better than making each
transfer separately.
Finally, higher bandwidth between host and device is achieved when using page-
locked (or pinned) memory, as discussed in the NVIDIA OpenCL Programming Guide
and section 3.1.1 of this document.

3.1.1 Pinned Memory


Page-locked or pinned memory transfers attain the highest bandwidth between host
and device. On PCIe ×16 Gen2 cards, for example, pinned memory can attain
greater than 5 GBps transfer rates.

August 31, 2009 9


NVIDIA OpenCL Best Practices Guide

OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance. The oclBandwidthTest program in
the NVIDIA GPU Computing SDK shows how to use these functions as well as
how to measure memory transfer performance. Additional examples of pinned
memory usage are provided in the oclSobelFilter and oclMedianFilter program samples
in the NVIDIA GPU Computing SDK.
Pinned memory should not be overused. Excessive use can reduce overall system
performance because pinned memory is a scarce resource. How much is too much
is difficult to tell in advance, so as with all optimizations, test the applications and
the systems they run on for optimal performance parameters.
The steps normally needed to use pinned memory are briefly summarized in the
following example.

1) Declare cl_mem buffer objects for the pinned host memory and the GPU
device GMEM, respectively, and standard pointers to reference pinned host
memory.
cl_context cxGPUContext;
cl_mem cmPinnedBufIn = NULL;
cl_mem cmPinnedBufOut = NULL;
cl_mem cmDevBufIn = NULL;
cl_mem cmDevBufOut = NULL;
unsigned char* cDataIn = NULL;
unsigned char* cDataOut = NULL;

2) Allocate cl_mem buffer objects for the pinned host memory and the GPU
device GMEM, respectively. Because these are time consuming operations, and
because many applications don’t need to change the size of these buffers within
time-critical code paths, these functions are commonly executed in an
application initialization function or event driven function (not in any program
loop to be executed quickly and frequently).
cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY |
CL_MEM_ALLOC_HOST_PTR, memSize,
NULL, NULL);
cmPinnedBufOut = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY |
CL_MEM_ALLOC_HOST_PTR, memSize,
NULL, NULL);

cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY,


memSize, NULL, NULL);
cmDevBufOut = clCreateBuffer(cxGPUContext,
CL_MEM_WRITE_ONLY, memSize,
NULL, NULL);

10 August 31, 2009


Memory Optimizations

3) Map standard pointer to reference the pinned host memory input and output
buffers with standard pointers.
cDataIn = (unsigned char*)clEnqueueMapBuffer(cqCommandQue,
cmPinnedBufIn, CL_TRUE,
CL_MAP_WRITE, 0, memSize, 0,
NULL, NULL, NULL);
cDataOut = (unsigned char*)clEnqueueMapBuffer(cqCommandQue,
cmPinnedBufOut, CL_TRUE,
CL_MAP_READ, 0, memSize, 0,
NULL, NULL, NULL);

4) Initialize or update the pinned memory content, using the standard host pointer
and standard host code. This might be done during program initialization
function or at any appropriate time by such means as an asynchronous data
acquisition function.
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}

5) Write data from pinned host memory to the GPU device GMEM any time in
the application that “fresh” data has been written to the pinned host memory.
This step #5, along with steps #6 and #7, commonly constitute a core
sequence (copy input data to GPU, compute on GPU, copy results back to
CPU) in an application with a recurring main loop, such as an application with a
GLUT display callback loop.

clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0,


szBuffBytes, cDataIn, 0, NULL, NULL);

6) Run computation kernel on the GPU device.

clEnqueueNDRangeKernel(cqCommandQue, …);

7) Read data from GPU device GMEM to pinned host memory. Note that this
example uses a blocking read to assure the read is complete (which would make
sense if the next step in the application was to display or otherwise use the
processed data from the host). Also note that this read would be unnecessary in
applications using CL-GL interop if the destination for the computed data is
only a graphics window, not main CPU memory.

clEnqueueReadBuffer(cqCommandQue, cmDevBufOut, CL_TRUE, 0,


szBuffBytes, cDataOut, 0, NULL, NULL)

August 31, 2009 11


NVIDIA OpenCL Best Practices Guide

3.2 Device Memory Spaces


CUDA devices use several memory spaces, which have different characteristics that
reflect their distinct usages in OpenCL applications. These memory spaces include
global, local, shared, texture, and registers, as shown in Figure 3.2. The CUDA local
memory space is not to be confused with OpenCL local memory, which maps to
CUDA shared memory.

To Host

Figure 3.2 The various memory spaces on a CUDA device

Of these different memory spaces, global and texture memory are the most
plentiful. There is a 16 KB per thread limit on local memory, a total of 64 KB of
constant memory, and a limit of 16 KB of shared memory, and either 8,192 or
16,384 32-bit registers per multiprocessor. Global, local, and texture memory have
the greatest access latency (although texture is cached), followed by constant
memory, registers, and shared memory.
The various principal traits of the memory types are shown in Table 3.1.
Table 3.1 Salient features of device memory

Memory Location Cached Access Scope Lifetime


on/off chip

Register On n/a R/W 1 thread Thread

Local Off No R/W 1 thread Thread

Shared On n/a R/W All threads in block Block

Global Off No R/W All threads + host Host allocation

12 August 31, 2009


Memory Optimizations

Constant Off Yes R All threads + host Host allocation

Texture Off Yes R All threads + host Host allocation

3.2.1 Coalesced Access to Global Memory


High Priority: Ensure global memory accesses are coalesced whenever possible.

Perhaps the single most important performance consideration in programming for


the CUDA architecture is coalescing global memory accesses. Global memory loads
and stores by threads of a half warp (16 threads) are coalesced by the device in as
few as one transaction (or two transactions in the case of 128-bit words) when
certain access requirements are met. To understand these access requirements,
global memory should be viewed in terms of aligned segments of 16 and 32 words.
Figure 3.3 helps explain coalescing of a half warp of 32-bit words, such as floats. It
shows global memory as rows of 64-byte aligned segments (16 floats). Two rows of
the same color represent a 128-byte aligned segment. A half warp of threads that
accesses the global memory is indicated at the bottom of the figure.

Figure 3.3 Linear memory segments and threads in a half warp

The access requirements for coalescing depend on the compute capability of the
device:
‰ On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must
access the k-th word in a segment aligned to 16 times the size of the elements
being accessed; however, not all threads need to participate.
‰ On devices of compute capability 1.2 or higher, coalescing is achieved for any
pattern of accesses that fits into a segment size of 32 bytes for 8-bit words,
64 bytes for 16-bit words, or 128 bytes for 32- and 64-bit words. Smaller
transactions may be issued to avoid wasting bandwidth. More precisely, the
following protocol is used to issue a memory transaction for a half warp:
¾ Find the memory segment that contains the address requested by the lowest
numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for
16-bit data, and 128 bytes for 32-, 64-, and 128-bit data.

August 31, 2009 13


NVIDIA OpenCL Best Practices Guide

¾ Find all other active threads whose requested address lies in the same
segment, and reduce the transaction size if possible:
Š If the transaction is 128 bytes and only the lower or upper half is used,
reduce the transaction size to 64 bytes.
Š If the transaction is 64 bytes and only the lower or upper half is used,
reduce the transaction size to 32 bytes.
¾ Carry out the transaction and mark the serviced threads as inactive.
¾ Repeat until all threads in the half warp are serviced.
These concepts are illustrated in the following simple examples.

3.2.1.1 A Simple Access Pattern


The first and simplest case of coalescing can be achieved by any device: the k-th
thread accesses the k-th word in a segment; the exception is that not all threads
need to participate. (See Figure 3.4.)

Figure 3.4 Coalesced access in which all threads but one access the
corresponding word in a segment

This access pattern results in a single 64-byte transaction, indicated by the red
rectangle. Note that even though one word is not requested, all data in the segment
are fetched. If accesses by threads were permuted within this segment, still one 64-
byte transaction would be performed by a device with compute capability 1.2 or
higher, but 16 serialized transactions would be performed by a device with compute
capability 1.1 or lower.

3.2.1.2 A Sequential but Misaligned Access Pattern


If sequential threads in a half warp access memory that is sequential but not aligned
with the segments, then a separate transaction results for each element requested on
a device with compute capability 1.1 or lower. On a device with compute capability
1.2 or higher, several different scenarios can arise depending on whether all
addresses for a half warp fall within a single 128-byte segment. If the addresses fall
within a 128-byte segment, then a single 128-byte transaction is performed, as
shown in Figure 3.5.

14 August 31, 2009


Memory Optimizations

Figure 3.5 Unaligned sequential addresses that fit within a single 128-
byte segment

If a half warp accesses memory that is sequential but split across two 128-byte
segments, then two transactions are performed. In the following case, illustrated in
Figure 3.6, one 64-byte transaction and one 32-byte transaction result.

Figure 3.6 Misaligned sequential addresses that fall within two 128-byte
segments

Device memory allocated through OpenCL is guaranteed to be aligned to at least


256 bytes. Therefore, choosing sensible thread block sizes, such as multiples of 16,
facilitates memory accesses by half warps that are aligned to segments. In addition,
the qualifiers __attribute__ ((aligned(8))) and
__attribute__ ((aligned(16))) can be used when defining structures to ensure
alignment to segments.

3.2.1.3 Effects of Misaligned Accesses


It is easy and informative to explore the ramifications of misaligned accesses using a
simple copy kernel, such as the one in Listing 3.5.
__kernel void offsetCopy(__global float *odata,
__global float* idata,
int offset)
{
int xid = get_global_id(0) + offset;
odata[xid] = idata[xid];
}

Listing 3.5 A copy kernel that illustrates misaligned accesses

In Listing 3.5, data is copied from the input array idata to the output array, both of
which exist in global memory. The kernel is executed within a loop in host code that

August 31, 2009 15


NVIDIA OpenCL Best Practices Guide

varies the parameter offset from 1 to 32. (Figures 3.5 and 3.6 correspond to
offsets of 1 and 17, respectively.) The effective bandwidth for the copy with various
offsets on an NVIDIA GeForce GTX 280 (with compute capability 1.3) and an
NVIDIA GeForce GTX 8800 (compute capability 1.0) are shown in Figure 3.7.

Figure 3.7 Performance of offsetCopy kernel

For the NVIDIA GeForce GTX 8800 device, global memory accesses with no
offset or with offsets that are multiples of 16 result in a single transaction per half
warp and an effective bandwidth of approximately 74 GBps. Otherwise, 16
transactions are issued per half warp resulting in an effective bandwidth of
approximately 7 GBps. This roughly 8x performance degradation is due to the fact
that 32 bytes, the minimum transaction size, are fetched for each thread. However,
only 4 bytes of data are used for each 32 bytes fetched—resulting in the 4/32=1/8
performance relative to the fully coalesced case. The two numbers also reflect the
different data represented by effective bandwidth (4 bytes) versus actual bandwidth
(32 bytes).
Because of this possible performance degradation, memory coalescing is the most
critical aspect of performance optimization of device memory. For the NVIDIA
GeForce GTX 280 device, the situation is less dire for misaligned accesses because,
in all cases, access by a half warp of threads in this kernel results in either one or
two transactions. As such, the effective bandwidth is between 120 GBps for a single
transaction and 70 GBps for two transactions per half warp. The number of
transactions issued for a half warp of threads depends on the offset and whether the
warp is even- or odd-numbered. For offsets of 0 or 16, each half warp results in a
single 64-byte transaction (Figure 3.4). For offsets of 1 through 7 or 9 through 15,
even-numbered warps result in a single 128-byte transaction (Figure 3.5) and odd-
numbered warps result in two transactions: one 64-byte and one 32-byte (Figure
3.6). For offsets of 8, even-numbered warps result in one 128-byte transaction and
odd-numbered warps result in two 32-byte transactions. The two 32-byte

16 August 31, 2009


Memory Optimizations

transactions, rather than a 64- and a 32-byte transaction, are responsible for the blip
at the offset of 8 in Figure 3.7.

3.2.1.4 Strided Accesses


Although the relaxed coalescing restrictions for devices with compute capability 1.2
or higher achieve one-half full bandwidth for the offset copy case just described,
performance on such devices can degrade when successive threads in a half warp
access memory locations that have non-unit strides. This pattern occurs frequently
when dealing with multidimensional data or matrices; for example, when a half warp
of threads accesses matrix elements columnwise and the matrix is stored in row-
major order.
To illustrate the effect of strided access on effective bandwidth, see the following
kernel strideCopy(), which copies data with a stride of stride elements between
threads from idata to odata.
__kernel void strideCopy(__global float* odata,
__global float* idata,
int stride)
{
int xid = get_global_id(0) * stride;
odata[xid] = idata[xid];
}

Listing 3.6 A kernel to illustrate non-unit stride data copy

Figure 3.8 illustrates a situation that can be created using the code in Listing 3.6;
namely, threads within a half warp access memory with a stride of 2. This action is
coalesced into a single 128-byte transaction on an NVIDIA GeForce GTX 280
(compute capability 1.3).

Figure 3.8 A half warp accessing memory with a stride of 2

Although a stride of 2 results in a single transaction, note that half the elements in
the transaction are not used and represent wasted bandwidth. As the stride
increases, the effective bandwidth decreases until the point where 16 transactions
are issued for the 16 threads in a half warp, as indicated in Figure 3.9.

August 31, 2009 17


NVIDIA OpenCL Best Practices Guide

Figure 3.9 Performance of strideCopy kernel

Note, however, that on the NVIDIA GTX 8800 device (compute capability 1.0),
any non-unit stride results in 16 separate transactions per half warp.
As illustrated in Figure 3.9, non-unit stride global memory accesses should be
avoided whenever possible. One method for doing so utilizes shared memory,
which is discussed in the next section.

3.2.2 Shared Memory


Because it is on-chip, shared memory (i.e. OpenCL __local memory) is much faster
than local and global memory. In fact, shared memory latency is roughly 100x lower
than global memory latency—provided there are no bank conflicts between the
threads, as detailed in the following section.

3.2.2.1 Shared Memory and Memory Banks


To achieve high memory bandwidth for concurrent accesses, shared memory is
divided into equally sized memory modules, called banks, that can be accessed
simultaneously. Therefore, any memory load or store of n addresses that spans n
distinct memory banks can be serviced simultaneously, yielding an effective
bandwidth that is n times as high as the bandwidth of a single bank.
However, if multiple addresses of a memory request map to the same memory bank,
the accesses are serialized. The hardware splits a memory request that has bank
conflicts into as many separate conflict-free requests as necessary, decreasing the
effective bandwidth by a factor equal to the number of separate memory requests.
The one exception here is when all threads in a half warp address the same shared
memory location, resulting in a broadcast.
To minimize bank conflicts, it is important to understand how memory addresses
map to memory banks and how to optimally schedule memory requests.

18 August 31, 2009


Memory Optimizations

Medium Priority: Accesses to shared memory should be designed to avoid serializing


requests due to bank conflicts.

Shared memory banks are organized such that successive 32-bit words are assigned
to successive banks and each bank has a bandwidth of 32 bits per clock cycle. The
bandwidth of shared memory is 32 bits per bank per clock cycle.
For devices of compute capability 1.x, the warp size is 32 threads and the number of
banks is 16. A shared memory request for a warp is split into one request for the
first half of the warp and one request for the second half of the warp. Note that no
bank conflict occurs if only one memory location per bank is accessed by a half
warp of threads. Refer to the NV IDIA OpenCL Programming Guide for more
information on how accesses and banks can be matched to avoid conflicts.

3.2.2.2 Shared Memory in Matrix Multiplication (C = AB)


Shared memory enables cooperation between threads in a block. When multiple
threads in a block use the same data from global memory, shared memory can be
used to access the data from global memory only once. Shared memory can also be
used to avoid uncoalesced memory accesses by loading and storing data in a
coalesced pattern from global memory and then reordering it in shared memory.
Aside from memory bank conflicts, there is no penalty for nonsequential or
unaligned accesses by a half warp in shared memory.
The use of shared memory is illustrated via the simple example of a matrix
multiplication C = AB for the case with A of dimension Mx16, B of dimension
16xN, and C of dimension MxN. To keep the kernels simple, M and N are multiples
of 16. A natural decomposition of the problem is to use a block and tile size of
16x16 threads. Therefore, in terms of 16x16 tiles, A is a column matrix, B is a row
matrix, and C is their outer product. (See Figure 3.10.) A grid of N/16 by M/16
blocks is launched, where each thread block calculates the elements of a different
tile in C from a single tile of A and a single tile of B.

August 31, 2009 19


NVIDIA OpenCL Best Practices Guide

Figure 3.10 A block-column matrix (A) multiplied by a block-row matrix


(B) and the resulting product matrix (C)

To do this, the simpleMultiply kernel (Listing 3.7) calculates the output elements
of a tile of matrix C.
__kernel void simpleMultiply(__global float* a,
__global float* b,
__global float* c,
int N)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < TILE_DIM; i++) {
sum += a[row*TILE_DIM+i] * b[i*N+col];
}
c[row*N+col] = sum;
}

Listing 3.7 Unoptimized matrix multiplication

In Listing 3.7, a, b, and c are pointers to global memory for the matrices A, B, and
C, respectively; blockDim.x, blockDim.y, and TILE_DIM are all 16. Each thread in
the 16x16 block calculates one element in a tile of C. row and col are the row and
column of the element in C being calculated by a particular thread. The for loop
over i multiplies a row of A by a column of B, which is then written to C.
The effective bandwidth of this kernel is only 8.7 GBps on an NVIDIA GeForce
GTX 280 and 0.7 GBps on an NVIDIA GeForce GTX 8800. To analyze
performance, it is necessary to consider how half warps of threads access global
memory in the for loop. Each half warp of threads calculates one row of a tile of C,
which depends on a single row of A and an entire tile of B as illustrated in Figure
3.11.

20 August 31, 2009


Memory Optimizations

Figure 3.11 Computing a row (half warp) of a tile in C using one row of A
and an entire tile of B

For each iteration i of the for loop, all threads in a half warp read the same value
from global memory (the index row*TILE_DIM+i is constant within a half warp),
resulting in 16 transactions for compute capability 1.1 or lower, and 1 transaction
for compute capability 1.2 or higher. Even though the operation requires only 1
transaction for compute capability 1.2 or higher, there is wasted bandwidth in the
transaction because only 4 bytes out of a 32-byte transaction are used. For each
iteration, the 16 threads in a half warp read a row of the B tile, which is a sequential
and coalesced access for all compute capabilities.
The performance on a device of any compute capability can be improved by reading
a tile of A into shared memory as shown in Listing 3.8.
__kernel void coalescedMultiply(__global float* a,
__global float* b,
__global float* c,
int N,
__local float aTile[TILE_DIM][TILE_DIM])
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0);
int y = get_local_id(1);
aTile[y][x] = a[row*TILE_DIM+x];
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[y][i]* b[i*N+col];
}
c[row*N+col] = sum;
}

August 31, 2009 21


NVIDIA OpenCL Best Practices Guide

Listing 3.8 Using shared memory to improve the global memory load
efficiency in matrix multiplication

In Listing 3.8, each element in a tile of A is read from global memory only once, in a
fully coalesced fashion (with no wasted bandwidth), to shared memory. Within each
iteration of the for loop, a value in shared memory is broadcast to all threads in a
half warp.
In Listing 3.8, a synchronization barrier call is not needed after reading the tile of A
into shared memory because only threads within the half warp that write the data
into shared memory read the data. This kernel has an effective bandwidth of 14.3
GBps on an NVIDIA GeForce GTX 280, and 8.2 GBps on an NVIDIA GeForce
GTX 8800.
A further improvement can be made to how Listing 3.8 deals with matrix B. In
calculating a tile’s row of matrix C, the entire tile of B is read. The repeated reading
of the B tile can be eliminated by reading it into shared memory once (Listing 3.9).
__kernel void sharedABMultiply(__global float* a,
__global float* b,
__global float* c,
int N,
__local float aTile[TILE_DIM][TILE_DIM],
__local float bTile[TILE_DIM][TILE_DIM])
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0);
int y = get_local_id(1);
aTile[y][threadIdx.x] = a[row*TILE_DIM+x];
bTile[y][threadIdx.x] = b[y*N+col];
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[y][i]* bTile[i][x];
}
c[row*N+col] = sum;
}

Listing 3.9 Improvement by reading additional data into shared memory

Note that in Listing 3.9, a barrier() call is required after reading the B tile because
a warp reads data from shared memory that were written to shared memory by
different warps. The effective bandwidth of this routine is 29.7 GBps on an
NVIDIA GeForce GTX 280 and 15.7 GBps on an NVIDIA GeForce GTX 8800.
Note that the performance improvement is not due to improved coalescing in either
case, but to avoiding redundant transfers from global memory.
The results of the various optimizations are summarized in Table 3.2.
Table 3.2 Performance improvements optimizing C = AB matrix multiply

Optimization NVIDIA GeForce NVIDIA GeForce


GTX 280 GTX 8800

No optimization 8.8 GBps 0.7 GBps

22 August 31, 2009


Memory Optimizations

Coalesced using shared


14.3 GBps 8.2 GBps
memory to store a tile of A

Using shared memory to


eliminate redundant reads 29.7 GBps 15.7 GBps
of a tile of B

Medium Priority: Use shared memory to avoid redundant transfers from global
memory.

3.2.2.3 Shared Memory in Matrix Multiplication (C = AAT)


A variant of the previous matrix multiplication can be used to illustrate how strided
accesses to global memory, as well as shared memory bank conflicts, are handled.
This variant simply uses the transpose of A rather than B, or C = AAT.
A simple implementation for C = AAT is shown in Listing 3.10.
__kernel void simpleMultiply(__global float *a,
__global float *c,
int M)
{
int row = get_global_id(1);
int col = get_global_id(0);
float sum = 0.0f;
for (int i = 0; i < TILE_DIM; i++) {
sum += a[row*TILE_DIM+i] * a[col*TILE_DIM+i];
}
c[row*M+col] = sum;
}

Listing 3.10 Unoptimized handling of strided accesses to global memory

In Listing 3.10, the row-th, col-th element of C is obtained by taking the dot product
of the row-th and col-th rows of A. The effective bandwidth for this kernel is
1.1 GBps on an NVIDIA GeForce GTX 280 and 0.5 GBps on an NVIDIA
GeForce GTX 8800. These results are substantially lower than the corresponding
measurements for the C = AB kernel. The difference is in how threads in a half
warp access elements of A in the second term, a[col*TILE_DIM+i], for each
iteration i. For a half warp of threads, col represents sequential columns of the
transpose of A, and therefore col*TILE_DIM represents a strided access of global
memory with a stride of 16. This results in uncoalesced memory accesses on devices
with compute capability 1.1 or lower and plenty of wasted bandwidth on devices
with compute capability 1.2 or higher. The way to avoid strided access is to use
shared memory as before, except in this case a half warp reads a row of A into a
column of a shared memory tile, as shown in Listing 3.11.
__kernel void coalescedMultiply(__global float *a,
__global float *c,
int M,
__local float aTile[TILE_DIM][TILE_DIM],
__local float transposedTile[TILE_DIM][TILE_DIM])
{

August 31, 2009 23


NVIDIA OpenCL Best Practices Guide

int row = get_global_id(1);


int col = get_global_id(0);
float sum = 0.0f;
int x = get_local_id(0);
int y = get_local_id(1);
aTile[y][x] = a[row*TILE_DIM+x];
transposedTile[x][y] =
a[(get_group_id(0) * get_num_groups(0) + y) * TILE_DIM + x];
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[y][i] * transposedTile[i][x];
}
c[row*M+col] = sum;
}

Listing 3.11 An optimized version of Listing 3.10 using coalesced reads


from global memory

Listing 3.11 uses the shared transposedTile to avoid uncoalesced accesses in the
second term in the dot product, and the shared aTile technique from the previous
example to avoid uncoalesced accesses in the first term. The effective bandwidth of
this kernel is 24.9 GBps on an NVIDIA GeForce GTX 280 and 13.2 GBps on an
NVIDIA GeForce GTX 8800. These results are slightly lower than those obtained
by the final kernel for C = AB. The cause of the difference is shared memory bank
conflicts.
The reads of elements in transposedTile within the for loop are free of conflicts,
because threads of each half warp read across rows of the tile, resulting in unit stride
across the banks. However, bank conflicts occur when copying the tile from global
memory into shared memory. To enable the loads from global memory to be
coalesced, data are read from global memory sequentially. However, this requires
writing to shared memory in columns, and because of the use of 16x16 tiles in
shared memory, this results in a stride between threads of 16 banks. These 16-way
bank conflicts are very expensive. The simple remedy is to pad the shared memory
array so that it has an extra column, as in the following line of code.
__local float transposedTile[TILE_DIM][TILE_DIM+1];
This padding eliminates the conflicts entirely, because now the stride between
threads is 17 banks, which, due to modular arithmetic used to compute bank
indices, is equivalent to a unit stride. After this change, the effective bandwidth is
30.4 GBps on an NVIDIA GeForce GTX 280 and 15.6 GBps on an NVIDIA
GeForce GTX 8800, which is comparable to the results from the last C = AB
kernel.
The results of these optimizations are summarized in Table 3.3.
Table 3.3 Performance improvements optimizing C = AAT matrix
multiplication

Optimization NVIDIA GeForce NVIDIA GeForce


GTX 280 GTX 8800

No optimization 1.1 GBps 0.5 GBps

24 August 31, 2009


Memory Optimizations

Using shared memory to


24.8 GBps 13.2 GBps
coalesce global reads

Removing bank conflicts 30.3 GBps 15.6 GBps

These results should be compared with those in Table 3.2. As can be seen from
these tables, judicious use of shared memory can dramatically improve performance.
The examples in this section have illustrated three ways to use shared memory:
‰ To enable coalesced accesses to global memory, especially to avoid large strides
(for general matrices, strides are much larger than 16)
‰ To eliminate (or reduce) redundant loads from global memory
‰ To avoid wasted bandwidth

3.2.2.4 Shared Memory Use by Kernel Arguments


Shared memory holds the parameters or arguments that are passed to kernels at
launch. In kernels with long argument lists, it can be valuable to put some
arguments into constant memory (and reference them there) rather than consume
shared memory.

Low Priority: For kernels with long argument lists, place some arguments into
constant memory to save shared memory.

3.2.3 Local Memory


CUDA local memory is so named because its scope is local to the thread, not
because of its physical location. In fact, local memory is off-chip. Hence, access to
local memory is as expensive as access to global memory. Like global memory, local
memory is not cached. In other words, the term “local” in the name does not imply
faster access. This is unlike OpenCL __local memory, which is on-chip and fast on
CUDA devices.
Local memory is used only to hold automatic variables. This is done by the compiler
when it determines that there is insufficient register space to hold the variable.
Automatic variables that are likely to be placed in local memory are large structures
or arrays that would consume too much register space and arrays that the compiler
determines may be indexed dynamically.
Inspection of the PTX assembly code reveals whether a variable has been placed in
local memory during the first compilation phases. If it has, it will be declared using
the .local mnemonic and accessed using the ld.local and st.local
mnemonics. If it has not, subsequent compilation phases might still decide
otherwise, if they find the variable consumes too much register space for the
targeted architecture.

August 31, 2009 25


NVIDIA OpenCL Best Practices Guide

3.2.4 Texture Memory


The read-only texture memory space is cached. Therefore, an image read costs one
device memory read only on a cache miss; otherwise, it just costs one read from the
texture cache. The texture cache is optimized for 2D spatial locality, so threads of
the same warp that read texture addresses that are close together will achieve best
performance. Texture memory is also designed for streaming reads with a constant
latency; that is, a cache hit reduces DRAM bandwidth demand, but not read latency.
In certain addressing situations, reading device memory through image objects can
be an advantageous alternative to reading device memory from global or constant
memory.

3.2.4.1 Textured Fetch vs. Global Memory Read


Device memory reads through image objects present several benefits over reads
from global memory:
‰ They are cached, potentially exhibiting higher bandwidth if there is 2D locality
in the texture fetches.
‰ Textures can be used to avoid uncoalesced loads from global memory.
‰ Packed data can be unpacked into separate variables in a single operation.
‰ 8-bit and 16-bit integer input data may be optionally converted to 32-bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0].
However, within the same kernel call, the texture cache is not kept coherent with
respect to image writes, so that any image read to an address that has been written
to via an image write in the same kernel call returns undefined data. In other words,
a thread can safely read via an image object some memory location only if this
memory location has been updated by a previous kernel call or memory copy, but
not if it has been previously updated by the same thread or another thread from the
same kernel call.

3.2.4.2 Additional Texture Capabilities


The hardware provides other capabilities that might be useful for some applications,
such as image processing. (See Table 3.4.)
Table 3.4 Useful imaging-processing features for image objects

Feature Use Caveat

Filtering Fast, low-precision Valid only for floating-point


interpolation between texels image objects

Normalized texture Resolution-independent coding


coordinates

Addressing modes Automatic handling of Can be used only with


boundary cases¹ normalized image coordinates

26 August 31, 2009


Memory Optimizations

¹The automatic handling of boundary cases in the bottom row of Table 3.4 refers to how a texture coordinate is
resolved when it falls outside the valid addressing range. There are two options: clamp and repeat. If x is the
coordinate and N is the number of texels for a one-dimensional texture, then with clamp, x is replaced by 0 if x < 0
and by 1-1/N if 1 ≤x. With repeat, x is replaced by frac(x) where frac(x) = x – floor(x). Floor returns the largest
integer less than or equal to x. So, in clamp mode where N = 1, an x of 1.3 is clamped to 1.0; whereas in repeat
mode, it is converted to 0.3

3.2.5 Constant Memory


There is a total of 64 KB constant memory on a device. The constant memory space
is cached. As a result, a read from constant memory costs one memory read from
device memory only on a cache miss; otherwise, it just costs one read from the
constant cache.
For all threads of a half warp, reading from the constant cache is as fast as reading
from a register as long as all threads read the same address. Accesses to different
addresses by threads within a half warp are serialized, so cost scales linearly with the
number of different addresses read by all threads within a half warp.

3.2.6 Registers
Generally, accessing a register consumes zero extra clock cycles per instruction, but
delays may occur due to register read-after-write dependencies and register memory
bank conflicts.
The latency of read-after-write dependencies is approximately 24 cycles, but this
latency is completely hidden on multiprocessors that have at least 192 active threads
(that is, 6 warps).
The compiler and hardware thread scheduler will schedule instructions as optimally
as possible to avoid register memory bank conflicts. They achieve the best results
when the number of threads per block is a multiple of 64. Other than following this
rule, an application has no direct control over these bank conflicts. In particular,
there is no register-related reason to pack data into float4 or int4 types.

3.2.6.1 Register Pressure


Register pressure occurs when there are not enough registers available for a given
task. Even though each multiprocessor contains either 8,192 or 16,384 32-bit
registers, these are partitioned among concurrent threads.

August 31, 2009 27


Chapter 4.
NDRange Optimizations

One of the keys to good performance is to keep the multiprocessors on the device
as busy as possible. A device in which work is poorly balanced across the
multiprocessors will deliver suboptimal performance. Hence, it’s important to
design your application to use threads and blocks in a way that maximizes hardware
utilization and to limit practices that impede the free distribution of work. A key
concept in this effort is occupancy, which is explained in the following sections.
Another important concept is the management of system resources allocated for a
particular task. How to manage this resource utilization is discussed in the final
sections of this chapter.

4.1 Occupancy
Thread instructions are executed sequentially in CUDA, and, as a result, executing
other warps when one warp is paused or stalled is the only way to hide latencies and
keep the hardware busy. Some metric related to the number of active warps on a
multiprocessor is therefore important in determining how effectively the hardware is
kept busy. This metric is occupancy.
Occupancy is the ratio of the number of active warps per multiprocessor to the
maximum number of possible active warps. (To determine the latter number, see
the oclDeviceQuery program in the NVIDIA GPU Computing SDK or refer to
Appendix A in the NVIDIA OpenCL Programming Guide.) Another way to view
occupancy is the percentage of the hardware’s ability to process warps that are
actively in use.
Higher occupancy does not always equate to higher performance—there is a point
above which additional occupancy does not improve performance. However, low
occupancy always interferes with the ability to hide memory latency, resulting in
performance degradation.

4.2 Calculating Occupancy


One of several factors that determine occupancy is register availability. Register
storage enables threads to keep local variables nearby for low-latency access.
However, the set of registers (known as the register file) is a limited commodity that
all threads resident on a multiprocessor must share. Registers are allocated to an
entire block all at once. So, if each thread block uses many registers, the number of

August 31, 2009 29


NVIDIA OpenCL Best Practices Guide

thread blocks that can be resident on a multiprocessor is reduced, thereby lowering


the occupancy of the multiprocessor.
For purposes of calculating occupancy, the following factors can be important.
Devices with compute capability 1.1 or lower have 8,192 32-bit registers per
multiprocessor. Devices with compute capability 1.2 or 1.3 have 16,384 32-bit
registers per multiprocessor. Multiprocessors with compute capability 1.1 and lower
can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads
per warp). This means that in a multiprocessor with 100 percent occupancy, every
thread can use 10 registers before occupancy is reduced. For compute capability 1.2
and 1.3, the corresponding number is 16 registers per thread (16,384 / (32 warps x
32 threads per warp)).
The preceding approach of determining how register count affects occupancy does
not take into account allocation granularity because register allocation is performed
per block. For example, on a device of compute capability 1.0, a kernel with 128-
thread blocks using 12 registers per thread results in an occupancy of 83 percent
with 5 active 128-thread blocks per multiprocessor, whereas a kernel with 256-
thread blocks using the same 12 registers per thread results in an occupancy of 66
percent because only two 256-thread blocks can reside on a multiprocessor. Not
only is register allocation performed per block, but it also is rounded to the nearest
256 registers per block on devices with compute capability 1.0 and 1.1, and it’s
rounded to the nearest 512 registers on devices with compute capability 1.2 and 1.3.
Because of these nuances in register allocation and the fact that a multiprocessor’s
shared memory is also partitioned between resident thread blocks, the relation
between register usage and occupancy can be difficult to determine.
NVIDIA provides an occupancy calculator in the form of an Excel spreadsheet that
enables developers to hone in on the optimal balance and to test different possible
scenarios. This spreadsheet, shown in Figure 4.1, is
CUDA_Occupancy_calculator.xls located in the tools directory of the SDK.

Figure 4.1 Use the CUDA GPU Occupancy Calculator to project occupancy

30 August 31, 2009


NDRange Optimizations

In addition to the calculator spreadsheet, occupancy can be determined using the


OpenCL Visual profiler.

4.3 Hiding Register Dependencies

Medium Priority: To hide latency arising from register dependencies, maintain at least
25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75
percent occupancy on later devices.

Register dependencies arise when an instruction uses a result stored in a register


written by an instruction before it. The latency on current CUDA-enabled GPUs is
approximately 24 cycles, so threads must wait 24 cycles before using an arithmetic
result. However, this latency can be completely hidden by the execution of threads
in other warps. To hide arithmetic latency completely, multiprocessors should be
running at least 192 threads (6 warps). This equates to 25 percent occupancy on
devices with compute capability 1.1 and lower, and 18.75 percent occupancy on
devices with compute capability 1.2 and higher.

4.4 Thread and Block Heuristics

Medium Priority: The number of threads per block should be a multiple of 32 threads,
because this provides optimal computing efficiency and facilitates coalescing.

The dimension and size of blocks per grid and the dimension and size of threads
per block are both important factors. The multidimensional aspect of these
parameters allows easier mapping of multidimensional problems to OpenCL and
does not play a role in performance. As a result, this section discusses size but not
dimension.
Latency hiding and occupancy depend on the number of active warps per
multiprocessor, which is implicitly determined by the execution parameters along
with resource (register and shared memory) constraints. Choosing execution
parameters is a matter of striking a balance between latency hiding (occupancy) and
resource utilization.
Choosing the NDRange parameters should be done in tandem; however, there are
certain heuristics that apply to each parameter individually. When choosing the
number of blocks per grid or grid size (i.e. number of work groups in OpenCL
terminology), the primary concern is keeping the entire GPU busy. The number of
blocks in a grid should be larger than the number of multiprocessors so that all
multiprocessors have at least one block to execute. Furthermore, there should be
multiple active blocks per multiprocessor so that blocks that aren’t waiting for a

August 31, 2009 31


NVIDIA OpenCL Best Practices Guide

barrier() can keep the hardware busy. This recommendation is subject to


resource availability; therefore, it should be determined in the context of the
blocksize execution parameter, as well as shared memory usage. To scale to future
devices, the number of blocks per kernel launch should be in the hundreds, as
kernels with thousands of blocks will scale across multiple future generations.
When choosing the number of threads per block, or the blocksize, it is important to
remember that multiple concurrent blocks can reside on a multiprocessor, so
occupancy is not determined by blocksize alone. In particular, a larger blocksize
does not imply a higher occupancy. For example, on a device of compute capability
1.1 or lower, a kernel with a maximum blocksize of 512 threads results in an
occupancy of 66 percent because the maximum number of threads per
multiprocessor on such a device is 768. Hence, only a single block can be active per
multiprocessor. However, a kernel with 256 threads per block on such a device can
result in 100 percent occupancy with three resident active blocks.
As mentioned in section 4.1 et seq., higher occupancy does not always equate to
better performance. For example, improving occupancy from 66 percent to
100 percent generally does not translate to a similar increase in performance. A
lower occupancy kernel will have more registers available per thread than a higher
occupancy kernel, which may result in less register spilling to local memory. In fact,
once an occupancy of 50 percent has been reached, additional increases in
occupancy do not translate into improved performance.
There are many such factors involved in selecting blocksize, and inevitably some
experimentation is required. However, a few rules of thumb should be followed:
‰ Threads per block should be a multiple of warp size to avoid wasting
computation on underpopulated warps and to facilitate coalescing.
‰ A minimum of 64 threads per block should be used, but only if there are
multiple concurrent blocks per multiprocessor.
‰ Between 128 and 256 threads per block is a better choice and a good initial
range for experimentation with different block sizes.
Note that when a thread block allocates more than the available registers on a
multiprocessor, the kernel invocation fails, as it will when too much shared memory
or too many threads are requested.

4.5 Effects of Shared Memory


Shared memory can be helpful in several situations, such as helping to coalesce or
eliminate redundant access to global memory. However, it also can act as a
constraint on occupancy. In many cases, the amount of shared memory used in a
kernel is related to the block size, but the mapping of threads to shared memory
elements does not need to be one-to-one. For example, it may be desirable to use a
32x32 element shared memory array in a kernel, but because the maximum number
of threads per block is 512, it is not possible to launch a kernel with 32x32 threads
per block. In such cases, kernels with 32x16 or 32x8 threads can be launched with
each thread processing two or four elements, respectively, of the shared memory
array. The approach of using a thread to process multiple elements of a shared

32 August 31, 2009


NDRange Optimizations

memory array can be beneficial even if limits such as threads per block are not an
issue. This is because some common operations can be performed by a thread once
and the cost amortized over the number of shared memory elements processed by a
thread.
A useful technique to determine the sensitivity of performance to occupancy is
through experimentation with the amount of dynamically allocated shared memory.
In OpenCL, the size of any __local pointer argument is specified outside the kernel
using clSetKernelArg(). By simply increasing this amount, it is possible to effectively
reduce the occupancy of the kernel and measure its effect on performance.
As mentioned in the previous section, once an occupancy of more than 50 percent
has been reached, it generally does not pay to optimize parameters to obtain higher
occupancy ratios. The previous technique can be used to determine whether such a
plateau has been reached.

August 31, 2009 33


NVIDIA OpenCL Best Practices Guide

34 August 31, 2009


Chapter 5.
Instruction Optimizations

Awareness of how instructions are executed often permits low-level optimizations


that can be useful, especially in code that is run frequently (the so-called hot spot in
a program). Best practices suggest that this optimization be performed after all
higher-level optimizations have been completed.
In this chapter, throughputs are given in number of operations per clock cycle per
multiprocessor. For a warp size of 32, an instruction consists of 32 operations.
Therefore, if T is the number of operations per clock cycle, the instruction
throughput is one instruction every 32/T clock cycles. All throughputs are for one
multiprocessor. They must be multiplied by the number of multiprocessors in the
device to get throughput for the whole device.
The use of -cl-mad-enable build option, which lets the compiler group add and mul
instructions into a single FMAD instruction whenever possible, is recommended as
it can lead to large performance gains. FMAD truncates the intermediate result of
the multiplication however.
The use of -cl-fast-relaxed-math build option enables many aggressive compiler
optimizations and should be considered as well.
The CUDA architecture is a scalar architecture. Therefore, there is no performance
benefit from using vector types and instructions. These should only be used for
convenience. It is also in general better to have more work-items than fewer using
large vectors.

5.1 Arithmetic Instructions


Single-precision floats provide the best performance and their use is highly
encouraged.
The throughput of single-precision floating-point add, multiply, and multiply-add is
8 operations per clock cycle.
The throughput of single-precision reciprocal, reciprocal square root, and
native_logf(x)are 2 operations per clock cycle. (Refer to Appendix B of the
NVIDIA OpenCL Programming Guide.)
The throughput of 32-bit integer multiplication is 2 operations per clock cycle, but
mul24 (refer to Appendix B of the NVIDIA OpenCL Programming Guide) provides
signed and unsigned 24-bit integer multiplication with a throughput of 8 operations
per clock cycle. On future architectures, however, mul24 will be slower than 32-bit
integer multiplication, so you should provide two kernels, one using mul24 and the

August 31, 2009 35


NVIDIA OpenCL Best Practices Guide

other using generic 32-bit integer multiplication, to be called appropriately by the


application.

5.1.1 Division and Modulo Operations


Low Priority: Use shift operations to avoid expensive division and modulo calculations.

Integer division and modulo operations are particularly costly and should be avoided
or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is
equivalent to (i log2(n)) and (i % n) is equivalent to (i & (n-1)).
The compiler will perform these conversions if n is literal. (For further information,
refer to Chapter 3 of the NVIDIA OpenCL Programming Guide).

5.1.2 Reciprocal Square Root


The reciprocal square root should always be invoked explicitly as rsqrt(). The
compiler optimizes 1.0f/sqrt(x) into rsqrt() only when this does not violate
IEEE-754 semantics.

5.1.3 Other Arithmetic Instructions


Low Priority: Avoid automatic conversion of doubles to floats.

The compiler must on occasion insert conversion instructions, introducing


additional execution cycles. This is the case for
‰ Functions operating on char or short whose operands generally need to be
converted to an int
‰ Double-precision floating-point constants (defined without any type suffix)
used as input to single-precision floating-point computations
The latter case can be avoided by using single-precision floating-point constants,
defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f. This
specification has accuracy implications in addition to its ramifications on
performance. The effects on accuracy are discussed in Chapter 7.
For single-precision code, use of the float type and the single-precision math
functions are highly recommended. When compiling for devices without native
double-precision support, such as devices of compute capability 1.2 and earlier, each
double variable is converted to single-precision floating-point format (but retains its
size of 64 bits) and double-precision arithmetic is demoted to single-precision
arithmetic.
It should also be noted that the math library function for complementary error
function, erfc(), is particularly fast with full single-precision accuracy.

36 August 31, 2009


Instruction Optimizations

5.1.4 Math Libraries


Medium Priority: Use the native math library whenever speed trumps precision.

Two types of runtime math operations are supported. They are


native_functionName() and functionName(). Functions using
native_functionName() map directly to the hardware level. They are faster but
provide somewhat lower accuracy. (Examples: native_sin(x), native_exp(x),
and so forth.) Functions using functionName() are slower but have higher
accuracy. (Examples: sin(x), exp(x), and so forth.) The throughput of
native_sin(x), native_cos(x), native_exp(x) is 1 operation per clock cycle,
while sin(x), cos(x), tan(x) are much more expensive and become even more so
(about an order of magnitude slower) if the absolute value of x needs to be reduced.
Moreover, in such cases, the argument-reduction code uses local memory, which
can affect performance even more because of the high latency of local memory.
More details are available in the NVIDIA OpenCL Programming Guide.

5.2 Memory Instructions

High Priority: Minimize the use of global memory. Prefer shared memory access
where possible.

Memory instructions include any instruction that reads from or writes to shared,
local, or global memory. The throughput of memory optimizations is 8 operations
per clock cycle. When accessing local or global memory, there are, in addition, 400
to 600 clock cycles of memory latency.
As an example, the throughput for the assignment operator in the following sample
code
__local float shared[32];
__global float* device;
shared[threadIdx.x] = device[threadIdx.x];
is 8 operations per clock cycle to issue a read from global memory, 8 operations per
clock cycle to issue a write to shared memory, but, crucially, there is a latency of 400
to 600 clock cycles to read data from global memory.
Much of this global memory latency can be hidden by the thread scheduler if there
are sufficient independent arithmetic instructions that can be issued while waiting
for the global memory access to complete. However, it is best to avoid accessing
global memory whenever possible.

August 31, 2009 37


Chapter 6.
Control Flow

6.1 Branching and Divergence

High Priority: Avoid different execution paths within the same warp.

Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is,
to follow different execution paths. If this happens, the different execution paths
must be serialized, increasing the total number of instructions executed for this
warp. When all the different execution paths have completed, the threads converge
back to the same execution path.
To obtain best performance in cases where the control flow depends on the thread
ID, the controlling condition should be written so as to minimize the number of
divergent warps.
This is possible because the distribution of the warps across the block is
deterministic as mentioned in section 2.1.1 of the NVIDIA OpenCL Programming
Guide. A trivial example is when the controlling condition depends only on
(threadIdx / WSIZE) where WSIZE is the warp size.
In this case, no warp diverges because the controlling condition is perfectly aligned
with the warps.

6.2 Branch Predication

Low Priority: Make it easy for the compiler to use branch predication in lieu of loops
or control statements.

Sometimes, the compiler may unroll loops or optimize out if or switch statements
by using branch predication instead. In these cases, no warp can ever diverge. The
programmer can also control loop unrolling using
#pragma unroll

For more information on this pragma, refer to the NVIDIA OpenCL Programming
Guide.
When using branch predication, none of the instructions whose execution depends
on the controlling condition is skipped. Instead, each such instruction is associated

August 31, 2009 39


NVIDIA OpenCL Best Practices Guide

with a per-thread condition code or predicate that is set to true or false according to
the controlling condition. Although each of these instructions is scheduled for
execution, only the instructions with a true predicate are actually executed.
Instructions with a false predicate do not write results, and they also do not evaluate
addresses or read operands.
The compiler replaces a branch instruction with predicated instructions only if the
number of instructions controlled by the branch condition is less than or equal to a
certain threshold: If the compiler determines that the condition is likely to produce
many divergent warps, this threshold is 7; otherwise it is 4.

40 August 31, 2009


Appendix A.
Recommendations and Best Practices

This appendix contains a list of all the recommendations for optimization and the
list of best practices that are explained in this document.

A.1 Overall Performance Optimization Strategies


Performance optimization revolves around three basic strategies:
‰ Maximizing parallel execution
‰ Optimizing memory usage to achieve maximum memory bandwidth
‰ Optimizing instruction usage to achieve maximum instruction throughput
Maximizing parallel execution starts with structuring the algorithm in a way that
exposes as much data parallelism as possible. Once the parallelism of the algorithm
has been exposed, it needs to be mapped to the hardware as efficiently as possible.
This is done by carefully choosing the NDRange of each kernel invocation. The
application should also maximize parallel execution at a higher level by explicitly
exposing concurrent execution on the device through streams, as well as maximizing
concurrent execution between host and device.
Optimizing memory usage starts with minimizing data transfers between the host
and the device because those transfers have much lower bandwidth than internal
device data transfers. Kernel access to global memory also should be minimized by
maximizing the use of shared memory on the device. Sometimes, the best
optimization might even be to avoid any data transfer in the first place by simply
recomputing the data whenever it is needed.
The effective bandwidth can vary by an order of magnitude depending on the access
pattern for each type of memory. The next step in optimizing memory usage is
therefore to organize memory accesses according to the optimal memory access
patterns. This optimization is especially important for global memory accesses,
because latency of access costs hundreds of clock cycles. Shared memory accesses,
in counterpoint, are usually worth optimizing only when there exists a high degree
of bank conflicts.
As for optimizing instruction usage, the use of arithmetic instructions that have low
throughput should be avoided. This suggests trading precision for speed when it
does not affect the end result, such as using intrinsics instead of regular functions or
single precision instead of double precision. Finally, particular attention must be
paid to control flow instructions due to the SIMT (single instruction multiple
thread) nature of the device.

August 31, 2009 41


NVIDIA OpenCL Best Practices Guide

A.2 High-Priority Recommendations


‰ To get the maximum benefit from OpenCL, focus first on finding ways to
parallelize sequential code. (Section 1.1.3)
‰ Use the effective bandwidth of your computation as a metric when measuring
performance and optimization benefits. (Section 2.2)
‰ Minimize data transfer between the host and the device, even if it means
running some kernels on the device that do not show performance gains when
compared with running them on the host CPU. (Section 3.1)
‰ Ensure global memory accesses are coalesced whenever possible. (Section 3.2.1)
‰ Minimize the use of global memory. Prefer shared memory access where
possible. (Section 5.2)
‰ Avoid different execution paths within the same warp. (Section 6.1)
‰ Use the -cl-mad-enable build option. (Chapter 5)

A.3 Medium-Priority Recommendations


‰ Accesses to shared memory should be designed to avoid serializing requests due
to bank conflicts. (Section 3.2.2.1)
‰ Use shared memory to avoid redundant transfers from global memory. (Section
3.2.2.2)
‰ To hide latency arising from register dependencies, maintain at least 25 percent
occupancy on devices with compute capability 1.1 and lower, and 18.75 percent
occupancy on later devices. (Section 4.3)
‰ The number of threads per block should be a multiple of 32 threads, because
this provides optimal computing efficiency and facilitates coalescing. (Section
4.4)
‰ Use the native math library whenever speed trumps precision. (Section 5.1.4)

A.4 Low-Priority Recommendations


‰ For kernels with long argument lists, place some arguments into constant
memory to save shared memory. (Section 3.2.2.4)
‰ Use shift operations to avoid expensive division and modulo calculations.
(Section 5.1.1)
‰ Avoid automatic conversion of doubles to floats. (Section 5.1.3)
‰ Make it easy for the compiler to use branch predication in lieu of loops or
control statements. (Section 6.2)

42 August 31, 2009


Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS,
LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING
PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED
WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no
responsibility for the consequences of use of such information or for any infringement of patents or other rights
of third parties that may result from its use. No license is granted by implication or otherwise under any patent
or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change
without notice. This publication supersedes and replaces all information previously supplied. NVIDIA
Corporation products are not authorized for use as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA, the NVIDIA logo, CUDA, GeForce, NVIDIA Quadro, and Tesla are trademarks or registered
trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective
companies with which they are associated.

Copyright
© 2009 NVIDIA Corporation. All rights reserved.

You might also like