📢 cuDNN Frontend v1.14 — Release Notes

cudnn frontend v1.15 release notes

cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.

New API

Introduced a new cudnn.Graph API that enables interoperability between torch.tensors and the cudnn frontend API. Sample code for performing a matmul with bias addition:

B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)

with cudnn.Graph(
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)

All notebooks under samples/python have been updated to showcase the flexibility of this API.

cudnn frontend now supports building editable pip wheels in place.
The cudnn frontend Graph now includes a warmup method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).

Improvements

SDPA

Introduced set_score_max and set_score_sum_exp to allow the kernel to output max attention score and sum of exponents.
Updated support surface checks. (SDPA bprop does not support the combination of s_q==1 and s_kv==1.)
SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.

Matmul

Added support for COMPLEX_FP32 and COMPLEX_FP64 datatypes. (Requires cuDNN v9.14.0 or later.)

Normalizations

Updated samples to prioritize fe::HeurMode_t::A over fe::HeurMode_t::FALLBACK.

Others

Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
Improved logging to clearly illustrate the different stages of graph creation.
The swish function now accepts a swish_beta parameter.

Samples

Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.

Bug Fixes

Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.

Benchmarks

Updated results with cuDNN 9.13.1 for B200 and GB300.

Issues Resolved

#160
#152

📢 cuDNN Frontend v1.14.1 — Release Notes

🚀 Improvements

🔹 SDPA

Improved support checks for Hopper backward propagation (bprop) to fix a bug (introduced in 9.11 and fixed in 9.13) affecting certain large head-dimension combinations of d_qk and d_v.
Added support for sink parameter for SDPA opertation. Refer new samples
- fp16_fwd_with_sink_token.cpp
  - fp16_bwd_with_sink_token.cpp

🔹 Pointwise

Added beta support for configurable beta for swish

🔹 Other Updates

See PR #163 and #165 for custom DLPack path

✅ Recommended Action: Upgrade to cuDNN Frontend v1.14.1 for full compatibility with cuDNN 9.13.0+, improved SDPA support, additional normalization support, and deviceless graph compilation features.

📢 cuDNN Frontend v1.14 — Release Notes

Preferred version for: cuDNN 9.12.0 and above
Minimum Python version: 3.9 (previously 3.8, now obsolete)
Updated pip wheels: Available for Python 3.13

🚀 Improvements

🔹 SDPA

Introduced a unified SDPA node → reduces graph creation latency & simplifies the SDPA graph creation.
(No API changes required from users.)
Improved support checks for SDPA FP8 forward propagation (fprop).
Improved support checks for Hopper backward propagation (bprop) to fix a bug (introduced in 9.11) affecting certain large head-dimension combinations of d_qk and d_v.
Added new SDPA samples with:
- Paged prefill
- Ragged Q tensor decode

🔹 Normalizations

Added support for fused LayerNorm with ReLU.
Included sample: LayerNorm with ReLU bitmask dump.

🔹 Matmul

Added Python sample for low-precision FP8/FP4 matrix multiplications.

🔹 Other Updates

Added Python bindings for deviceless graph compilation.
→ Sample: test_deviceless_aot_compilation.py
Addressed GitHub issue: #151.

✅ Recommended Action: Upgrade to cuDNN Frontend v1.14,0 for full compatibility with cuDNN 9.12.0+, improved SDPA support, additional normalization support, and deviceless graph compilation features.

cudnn frontend v1.13 release notes

cudnn frontend v1.13 is the preferred cudnn frontend version for cudnn version 9.11.0 and above.

New API

Introduces device descriptor, which allows for device-less compilation of cudnn graph on a target GPU. See newly added sample and documentation.

Improvements

SDPA

Introduced generate_stats as a replacement for is_inference, to improve clarity. When generate_stats is true, the output will contain the stats tensor. When migrating from is_inference (which is now deprecated), note that generate_stats has the opposite meaning, so pass it the negation of the bool that was passed to is_inference.
Improved support checks for left and right diagonal bands in conjunction with the diagonal alignment.
Improved error handling for large head dimension (d > 128) in sdpa bprop.

Normalizations

Added support for fused Layernorm with Relu and samples for Layernorm with relu bitmask dump

Others

Published improved SDPA training benchmarks for fp8 and fp16/bf16 graph patterns.
Enable int4 Weight only Quantization for matmul. See example
Allow block scale dequantize (required for low precision matmul) to take 2-D scale factor.
Allow reductions to accept deterministic as a attribute.
Added pybinds for block scale dequantize.

Bug Fixes

Fixed the sliding window attn_score_modifier function allowing it to set true negative infinity.

This release builds on top of the 1.12.0 release.

Bug fix

Fixes an issue where d=256 was marked not supported in Hopper

Minor Enhancements

Addressed several comments from code review.
Improved the cmake workflow. See PR 125

Benchmark Results

Published results of using cuDNN backend for default torch.sdpa op in comparison to other backend. See Llama-3.2-1B-Training for reference.
Published comparison results of sdpa() in comparison to other backends. See sdpa_benchmark_bf16_training

cudnn frontend v1.12 release notes

cudnn frontend v1.12 is the preferred cudnn frontend version for cudnn version 9.9.0 and above.

cudnn_frontend v1.12 is the minimum cudnn frontend version required to work with cuda 13.0 and above

Update the dlpack version and cmake minimum required version to be 3.18

New API

Allows compilation and loading of cudnn frontend with cudnn-jit packages.
Introduce Adaptive Layernorm (fprop and bprop) operation in cudnn.

std::array<std::shared_ptr<Tensor_attributes>, 3>
adalayernorm(std::shared_ptr<Tensor_attributes>& input,
                         std::shared_ptr<Tensor_attributes>&  scale,
                         std::shared_ptr<Tensor_attributes>&  bias,                                                                                                                                                                                   
                         AdaLayernorm_attributes attributes);

std::array<std::shared_ptr<Tensor_attributes>, 3> adalayernorm_backward(
                         std::shared_ptr<Tensor_attributes>  dy,
                         std::shared_ptr<Tensor_attributes>   x,
                         std::shared_ptr<Tensor_attributes>  scale,                                                                                                                                                    
                         AdaLayernorm_backward_attributes   options);

Please refer to samples for usage.

cudnn frontend python API introduces two decorator function cudnn.jit and cudnn.graph for simpler graph creation in python. Refer the matmul sample for usage.

Improvements

SDPA

Allows large embedded dimension (d > 128) for fprop across Ampere, Hopper, and Blackwell architectures for bf16/fp16.
Added better validation checks for sliding window attention for cudnn version 9.9.0 and below.
Sliding windown attention now supports cases when s_q > s_kv
sdpa_fp8 operation now pads correctly with negative infinity on masking operation rather than high negative value. This improves the numerical stability of the sdpa operation with fp8 data type.
Paged attention now supports page tables in a packed format

Normalizations

Allow zero-centered scale in layer norm. Refer to this sample for usage.

Others

cudnn frontend now supports serialization of dynamic kernel cache.

Bug Fixes

Fixed the dlopen of cudart.so to look for the binary with version name.
Correctly fail when SDPA bprop is called on Blackwell with embedded dimension (d) > 128.

v1.11.0 release

cudnn frontend v1.11 release notes

cudnn frontend v1.11 is the preferred cudnn frontend version for cudnn version 9.8.0 and above. With cuDNN frontend v1.11, the minimum supported cudnn version is 9.0.0.

Note: The FE will continue to build and run with cudnn_v8, until explicitly marked as compilation failure.

New API

cudnn frontend v1.11 releases flexible score modifier to the python SDPA API. Samples showcasing soft cap of the attention scores, arrow mask are available in the cudnn_frontend/test/python/test_flexible_sdpa.py file. A sample usage of score modifier is shown below:

        score_mod=partial(
            custom_mask,
            mod_tensor=mod_tensor,
            neg_inf=neg_inf_tensor,
            seq_len_q=seq_len_q,
            seq_len_kv=seq_len_kv,
        )

The Concatenate operation merges two or more tensors into one, along the specified axis. The user may also specify an in-place merge.

std::shared_ptr<Tensor_attributes>
concatenate(std::vector<std::shared_ptr<Tensor_attributes>>, Concatenate_attributes);

pip wheels compatible with windows x86_64 architecture are now available on pypi.
sdpa paged attention API now supports Q tensor to be ragged when used with cudnn version 9.7.0 and above.

Improvements

Users can now pass the CMake flag -DCMAKE_CXX_FLAGS="-DNV_CUDNN_FRONTEND_DISABLE_LOGGING" to disable logging in the cuDNN frontend.
Adds a new sample to showcase native cudagraph creation from cudnn for sdpa bprop operation. Fixed a bug when using the update_cuda_graph API to update cuda graph for sdpa bprop operation.
Updates the create_container_and_page_table example function to use the layout that's desired for the more performant kernel."

Bug Fixes

Fixes memory leak in the test harness for some legacy tests that use ragged tensors.
Fixes a bug introduced in the benchmarking script that prevented the sdpa cudnn operation from being executed. This was because the use_padding_mask attribute was made mandatory for the sdpa operation. This has been fixed as well.
Updates the paged attention sample to not cause illegal memory access when changing the dimensions of the tensors in the sample.
Updates the DgradDReluBNBwdWeight sample to perform the right operation for the dgrad + drelu fusion.

cudnn frontend v1.10 release notes

cudnn frontend v1.10 is the preferred cudnn frontend to be used for
cudnn backend 9.7.0 and later as it adds to the Blackwell specific
features.

New API

cudnn Frontend v1.10 introduces two new operators,
block_scale_quantize and block_scale_dequantize to specify the scaling
and de-scaling of low precision datatypes supported from Blackwell GPU
onwards.
create_execution_plan(int64_t const engine_id, std::unordered_map<KnobType_t, int64_t> const &knobs) allows creation
of a custom execution plan with hardcoded engine and knobs. Added a
sample in samples/cpp/misc/custom_plan.cpp to showcase how to work
with different Engine and Knobs.

Improvements

Users can now query behavior notes of a particular execution plan
using get_behavior_notes(std::vector<BehaviorNote_t> &notes) const and
get_behavior_notes_for_plan_at_index(int64_t const index, std::vector<BehaviorNote_t> &notes) const functions.
SDPA operations now accept both left window and right window size with
respect to diagonal. See Attention.md for more details.
SDPA operations now accept a diagonal alignment for the Attention
score matrix to be used describe the above window. When s_q != s_kv,
and causal mask is on this can be used to specify if the diagonal is top
left or bottom right.
Bottom right causal masking can now be enabled on the sdpa_fp8
operation.

Bug fixes

Fixed a regression in cuDNN FrontEnd v1.9.0 where the softmax node
would override user-set dims and strides for softmax_stats and m_zinv.
This also affected sdpa_forward and sdpa_fp8_forward node

New samples

Added an example to showcase how native cuda graphs can be constructed
from the SDPA operation graph.

cudnn frontend v1.9 release notes

New API

Enhancements to flash attention API

SDPA_attributes and SDPA_bprop_attributes now accepts a score_mod function through set_score_mod and set_score_mod_bprop API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for fprop and bprop.
Added support for THD format and sliding window mask.
Added support for THD format and Bottom right causal mask.
Added support for bottom right causal masking with sliding window mask
Added a new parameter called set_max_total_seq_len_q/set_max_total_seq_len_kv on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.

Improvements

Allow creation of serialized json for dgrad, wgrad and resample operations.
Added more diagnostic message when the compiled version of cudnn does not match the run-time version of cudnn.

Bug fixes

Fixed an issue where log messages unparseable data at the end of messages.
Fixed an issue where while building the python pip wheel would hang.
Fixed natively creating cuda graphs for SDPA with alibi masks.

New samples

Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.

cudnn frontend v1.8 release:

New API

Paged Attention API

SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes now accepts set_paged_attention_k_table and set_paged_attention_v_table to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper

cuda Graph API

cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:

populate_cuda_graph : add the cudnn nodes to the empty cuda_graph provided as input.
update_cuda_graph : update the populated cuda graph with necessary data pointers.
See docs and backend documentation for more details.

Enhancements

Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.
graph.deselect_engines(str: ) has now a python equivalent through pybind11.
graph.tensor(...) can now accept int64_t scalars directly. (Previously limited to int32_t, float and fp16 data types).
fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.
More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.
SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.

Bug fixes

Fixed an issue while querying cudnnGetLastErrorString() from the backend. The error_t object will now have more meaningful message.
Fixed build issues seen with clang-19 compiler.
Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.

Releases: NVIDIA/cudnn-frontend

v1.15.0-release

cudnn frontend v1.15 release notes

New API

Improvements

SDPA

Matmul

Normalizations

Others

Samples

Bug Fixes

Benchmarks

Issues Resolved

Uh oh!

v1.14.1-release

🚀 Improvements

🔹 SDPA

🔹 Pointwise

🔹 Other Updates

Uh oh!

v1.14.0-release

📢 cuDNN Frontend v1.14 — Release Notes

🚀 Improvements

🔹 SDPA

🔹 Normalizations

🔹 Matmul

🔹 Other Updates

Uh oh!

v1.13.0-release

cudnn frontend v1.13 release notes

New API

Improvements

SDPA

Normalizations

Others

Bug Fixes

Uh oh!

v1.12.1 release

Bug fix

Minor Enhancements

Benchmark Results

Uh oh!

v1.12.0 release

cudnn frontend v1.12 release notes

New API

Improvements

SDPA

Normalizations

Others

Bug Fixes

Uh oh!

v1.11.0 release

v1.11.0 release

cudnn frontend v1.11 release notes

New API

Improvements

Bug Fixes

Uh oh!

v1.10.0 release

cudnn frontend v1.10 release notes

New API

Improvements

Bug fixes

New samples

Uh oh!

v1.9.0 release

cudnn frontend v1.9 release notes

New API

Enhancements to flash attention API

Improvements

Bug fixes

New samples

Uh oh!

v1.8.0 release

cudnn frontend v1.8 release:

New API

Paged Attention API

cuda Graph API

Enhancements

Bug fixes