KEMBAR78
Understanding PTX, the Assembly Language of CUDA GPU Computing | NVIDIA Technical Blog
Developer Tools & Techniques

Understanding PTX, the Assembly Language of CUDA GPU Computing

Parallel thread execution (PTX) is a virtual machine instruction set architecture that has been part of CUDA from its beginning. You can think of PTX as the assembly language of the NVIDIA CUDA GPU computing platform. 

In this post, we’ll explain what that means, what PTX is for, and what you need to know about it to make the most of CUDA for your applications. We’ll start by walking through how CUDA generates, stores, and loads the code which ultimately runs on the GPU. Then we’ll show how PTX enables forward compatibility and how it can be used to allow domain-specific and other programming languages to target CUDA.

Instruction set architecture

An instruction set architecture (ISA) is the specification for what instructions a processor can execute, their format, the behavior of those instructions, and their binary encodings. Every processor has an ISA. For example, x86_64 is a CPU ISA. ARM64 is another. A GPU has an ISA as well. For NVIDIA GPUs, the ISA can be different for GPUs of different generations or even different product lines within a generation.

A virtual machine ISA is a specification for a set of supported instructions, formats, and behaviors for a virtual processor. That is, it is simply the ISA for an abstract processor, not one that is actually produced. Virtual machine ISAs might not specify a binary encoding for instructions, since that is only needed for running on a physical processor.

The role of PTX in the CUDA platform

To illustrate how PTX fits into the CUDA platform, the following example shows how a simple CUDA file is compiled. This source file contains a single kernel, the classic example that adds two vectors in parallel, as well as a skeleton of the application main function and a helper function.

__global__ void vecAdd(float* a, float* b, float* c, int n)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n)
    {
        c[index] = a[index] + b[index];
    }
}


void vecAddLauncher(float* a, float* b, float* c, int n)
{
   … // Utility function for launching the kernel
}


int main()
{
 … // Main function of the application
}

When this file is compiled with the NVIDIA CUDA compiler, NVCC, the source is split into code for the GPU and code for the CPU. The GPU code is sent to the GPU compiler and the CPU code is sent to the host compiler. The host compiler is not part of NVCC. NVCC calls into a host compiler that is either passed in on the command line or is the default compiler on the system.

This figure describes the high-level compiler flow for a program. On the left, the host CPU code is compiled by the host compiler and put into the executable. On the right, the GPU code is compiled by NVCC into PTX, and optionally into CUBIN, and then also placed into the executable.
Figure 1. High-level compiler flow for the code example from source to executable

For functions and kernels that will run on the GPU, the GPU compiler generates the assembly language for the CUDA platform: PTX. This assembly (PTX) is then run through an assembler, called ptxas, which generates executable binary code for the GPU. The GPU binary is called a cubin, which is short for CUDA binary. 

Compiling GPU code consists of two stages: first, the high-level language code (C++) is compiled into PTX. Then, the PTX is compiled into a cubin. Invocation of ptxas is done automatically by NVCC when binary output is requested.

This compilation path is similar to how popular compilers like clang operate. clang first compiles code to a virtual machine ISA called LLVM IR. LLVM stands for Low Level Virtual Machine, IR stands for Intermediate Representation. A second stage or backend compiler, called LLVM, then compiles the virtual machine representation, LLVM IR,  into executable code for a specific processor. One advantage of this structure is that the LLVM IR for a program can be compiled to a binary for any hardware architecture supported by the LLVM backend compiler.

PTX is similar to LLVM IR in that the PTX representation of a program can be compiled to a wide range of NVIDIA GPUs. Importantly, this compilation of PTX for a specific GPU can happen just-in-time (JIT) at application runtime. As shown in Figure 1, the executable for an application can embed both GPU binaries (cubins) and PTX code. Embedding the PTX in the executable enables CUDA to JIT compile the PTX to the appropriate cubin at application runtime. The JIT compiler for PTX is part of the NVIDIA GPU driver

Embedding PTX in the application enables running the first stage of compilation—high-level language to PTX—when the application is compiled. The second stage of compilation—PTX to cubin—can be delayed until application runtime. As illustrated below, doing this allows the application to run on a wider range of GPUs, including GPUs released well after the application was built. 

Below is the PTX code for the vecAdd kernel from the example above. Those who have seen assembly language for any platform should find the syntax and formatting of PTX familiar. It is not necessary to understand the details of the code. Rather, it is provided to give a glimpse into PTX and clarify further what PTX is: the assembly language of the CUDA platform.

.visible .entry _Z6vecAddPfS_S_j(
.param .u64 _Z6vecAddPfS_S_j_param_0,
.param .u64 _Z6vecAddPfS_S_j_param_1,
.param .u64 _Z6vecAddPfS_S_j_param_2,
.param .u32 _Z6vecAddPfS_S_j_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<11>;


ld.param.u64 %rd1, [_Z6vecAddPfS_S_j_param_0];
ld.param.u64 %rd2, [_Z6vecAddPfS_S_j_param_1];
ld.param.u64 %rd3, [_Z6vecAddPfS_S_j_param_2];
ld.param.u32 %r2, [_Z6vecAddPfS_S_j_param_3];
mov.u32 %r3, %tid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %ctaid.x;
mad.lo.s32 %r1, %r5, %r4, %r3;
setp.ge.u32 %p1, %r1, %r2;
@%p1 bra $L__BB0_2;


cvta.to.global.u64 %rd4, %rd1;
mul.wide.u32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
cvta.to.global.u64 %rd7, %rd2;
add.s64 %rd8, %rd7, %rd5;
ld.global.f32 %f1, [%rd8];
ld.global.f32 %f2, [%rd6];
add.f32 %f3, %f2, %f1;
cvta.to.global.u64 %rd9, %rd3;
add.s64 %rd10, %rd9, %rd5;
st.global.f32 [%rd10], %f3;


$L__BB0_2:
ret;
}

Compute capability and NVIDIA GPU hardware ISAs

All NVIDIA GPUs have a version identifier known as the compute capability, or CC number. Each compute capability has a major and a minor version number. For example, compute capability 8.6 has a major version of 8 and a minor version of 6. 

Like any processor, NVIDIA GPUs have a specific ISA. GPUs from different generations have different ISAs. These ISAs are identified by a version number which corresponds to the GPU’s compute capability. When a binary (cubin) is compiled, it is compiled for a specific compute capability. 

For example, GeForce and RTX GPUs from the NVIDIA Ampere generation have a compute capability of 8.6 and their cubin version is sm_86. All cubin versions have the format sm_XY where X and Y correspond to the major and minor numbers of a compute capability.

NVIDIA GPUs of different generations and even different products within a generation can have different ISAs. This is part of the reason for having PTX.

PTX: A versioned, virtual machine assembly language

PTX is a virtual machine ISA. As mentioned previously, a virtual machine ISA is a set of instructions supported by a hypothetical processor, not any particular real processor. Because virtual machine ISAs are slightly more abstract than real hardware ISAs, the CUDA assembler ptxas behaves more like a compiler than a traditional assembler. It compiles PTX programs to GPU binaries, or cubins. 

GPU features have grown since CUDA was first introduced. With new GPU hardware generations, new features are added to GPUs. The virtual machine that PTX describes has also been expanded to match. The changes made to the PTX specification usually involve adding new instructions.

As a result, there are different versions of PTX that support different instruction sets. The PTX version number indicates what instructions are available in the virtual architecture. Like cubin versions, these version numbers correspond to a GPU compute capability. 

For example, the NVIDIA Ampere generation GA100 GPU has a compute capability of 8.0. The version of PTX called compute_80 has all instructions supported by GA100. PTX versions are called compute_XY where X and Y correspond to the major and minor numbers of a compute capability.

GPU code compatibility

CUDA provides two mechanisms for code compatibility between different GPUs: binary compatibility and PTX JIT compatibility. 

Binary compatibility

NVIDIA GPUs are binary compatible within a major compute capability version, as long as the minor version is the same or higher. This means a cubin compiled for sm_86 can be loaded on any sm_8x where x is greater than or equal to 6.

For example, a cubin compiled for sm_86 (for example, the NVIDIA RTX A4000) can also be loaded and run on sm_89 (for example, the NVIDIA RTX 4000 Ada Generation). It will not load, however, on a device of compute capability 8.0, because the minor version of the compute capability of that GPU is lower than the minor version of the cubin.

NVIDIA GPUs are not binary compatible across major compute capability versions. A cubin compiled for sm_86 will not load and run on a compute capability 9.0 (NVIDIA Hopper architecture) or later GPU.

PTX JIT compatibility

Embedding PTX in an executable provides a mechanism for compatibility across GPUs of different compute capabilities, including different major versions, within a single binary file. As illustrated in the executable in Figure 1, both PTX and cubin can be stored in the final application executable. PTX and cubin can also be stored in libraries. 

When the PTX code is stored in an application or library binary, it can be JIT compiled for the GPU it is being loaded on. For example, if the application or library contains PTX targeting compute_70, that PTX can be JIT compiled for any GPU of compute capability 7.0 or higher, including compute capability 8.x, 9.x, 10.x, and 12.x. 

PTX cannot be JIT compiled for compute capabilities lower than the PTX version. For example,  PTX targeting compute_70 cannot be JIT compiled for a compute capability 5.x or 6.x GPU. 

Fatbins

When CUDA applications or libraries are built, they have a container called a fatbin. The fatbin can contain multiple cubins and PTX versions of the GPU code. For example, the fatbin in the executable shown in Figure 2 contains PTX for compute_70 as well as cubins for sm_70, sm_80, and sm_86. This means that the application already has binary code for compute capability 7.0, 8.0, and 8.6 GPUs. The sm_86 cubin can also be loaded if the application is run on a compute capability 8.9 GPU. 

This is an image showing an executable fatbin for GPUs, which includes the CPU binary code, the PTX for compute_70, and the cubins for SM_70, SM_80, and SM_86.
Figure 2. An executable that has prebuilt binary code for multiple different GPUs, as well as PTX

The compute_70 PTX can be used to JIT compile for any GPU of compute capability 7.0 or higher, so this application can be run on GPUs newer than the targets for which cubins are available. For example, this application can run on a GPU of compute capability 9.0, 10.0, or 12.0 without the need to rebuild the application. Table 1 shows how each of the embedded cubins and PTX enable compatibility in this specific example.

CC 7.0CC7.5CC8.0CC8.6CC8.9CC 9.0CC 10.0CC 12.0Future CCs
PTX compute_70✔️✔️✔️✔️✔️✔️✔️✔️✔️
cubin sm_70✔️✔️
cubin sm_80✔️✔️✔️
cubin sm_86✔️✔️
Table 1. Compute capabilities that can run each part of the fatbin shown in Figure 2

The PTX is compiled by the driver when the application starts or when code is first used on the GPU. For details about controlling when the JIT compilation occurs, see the section on Lazy Loading in the CUDA Programming Guide.

Benefits of PTX

Using PTX as an intermediate code format, the CUDA platform enables developers to build application binaries that will run on GPUs that have not yet been created. An application compiled for the NVIDIA Turing architecture (CC 7.5) in 2018 can run on an NVIDIA Blackwell (CC 12.0) GPU in 2025, and on GPUs released in the future. 

By embedding the PTX representation of GPU code in executables or libraries, the CUDA driver can JIT compile the PTX code at runtime for an architecture that wasn’t even conceived when the application was compiled. For developers that distribute binary versions of their application or libraries, this allows the application or library to run on future GPU architectures without the need to update the binary.

By serving as the assembly language for the CUDA GPU computing platform, PTX also provides a representation that can be targeted by compilers of any language. For example, a domain-specific language (DSL) compiler can generate PTX code, which can then run on NVIDIA GPUs. OpenAI Triton is an example of a DSL that generates PTX.

Developers interested in making DSLs should explore NVVM IR and libNVVM, as these will likely be preferable to implementing bespoke PTX generation.

Handwriting PTX

The PTX virtual machine ISA is documented by NVIDIA. It is possible to handwrite PTX code. Like other assembly languages, this is often not a good choice for large software projects. Higher level languages provide better developer productivity than directly addressing assembly or virtual assembly languages. 

Writing PTX by hand should be considered similar to writing CPU assembly code by hand: it allows experts with insights into how the processor will execute the code to exert fine-grained control of the instructions in the final executable. While it is possible to find performance improvements in doing this, it is not usually necessary or advisable for most developers. 

That said, some developers do choose to pursue optimizing code by directly writing PTX for inner loop code that will be run billions of times or more. On such code, even small performance gains are multiplied by the large trip count, making the effort of careful and manual optimization worthwhile. We plan to provide some examples of this type of optimization in a future post.

libcu++, which is included in the CUDA Toolkit, provides a cuda::ptx namespace that provides functions that map directly to PTX instructions. This makes it easy to use specific PTX instructions within a C++ application. For more information on the cuda::ptx namespace, see the libcu++ documentation. Additionally, NVIDIA provides documentation on directly inlining PTX within C++ code

Summary

PTX is a virtual machine ISA that can be thought of as the assembly language for the CUDA GPU computing platform. PTX is an essential part of the CUDA GPU computing platform. Higher level languages compile to PTX, and then PTX compiles to binary code at compile time or at runtime.

By embedding PTX code in their binaries, applications and libraries can achieve cross-generation compatibility within a single binary. Additionally, compilers for other programming or domain-specific languages can compile to PTX and then use ptxas or JIT compilation to generate binary code capable of running on NVIDIA GPUs.

As a developer, you will maximize the compatibility of your GPU code by including PTX in your application or library.

Acknowledgments

Thanks to Rob Armstrong and Jake Hemstad for contributing to this post.

Discuss (1)

Tags