KEMBAR78
Releases · NVIDIA/cudnn-frontend · GitHub
Skip to content

Releases: NVIDIA/cudnn-frontend

v1.15.0-release

10 Oct 18:30
0b1577c

Choose a tag to compare

cudnn frontend v1.15 release notes

cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.

New API

  • Introduced a new cudnn.Graph API that enables interoperability between torch.tensors and the cudnn frontend API. Sample code for performing a matmul with bias addition:
B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)

with cudnn.Graph(
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)

All notebooks under samples/python have been updated to showcase the flexibility of this API.

  • cudnn frontend now supports building editable pip wheels in place.
  • The cudnn frontend Graph now includes a warmup method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).

Improvements

SDPA

  • Introduced set_score_max and set_score_sum_exp to allow the kernel to output max attention score and sum of exponents.
  • Updated support surface checks. (SDPA bprop does not support the combination of s_q==1 and s_kv==1.)
  • SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.

Matmul

  • Added support for COMPLEX_FP32 and COMPLEX_FP64 datatypes. (Requires cuDNN v9.14.0 or later.)

Normalizations

  • Updated samples to prioritize fe::HeurMode_t::A over fe::HeurMode_t::FALLBACK.

Others

  • Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
  • Improved logging to clearly illustrate the different stages of graph creation.
  • The swish function now accepts a swish_beta parameter.

Samples

  • Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
  • Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
  • Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.

Bug Fixes

  • Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.

Benchmarks

  • Updated results with cuDNN 9.13.1 for B200 and GB300.

Issues Resolved

v1.14.1-release

05 Sep 04:42
1a7b4b7

Choose a tag to compare

📢 cuDNN Frontend v1.14.1 — Release Notes

🚀 Improvements

🔹 SDPA

  • Improved support checks for Hopper backward propagation (bprop) to fix a bug (introduced in 9.11 and fixed in 9.13) affecting certain large head-dimension combinations of d_qk and d_v.
  • Added support for sink parameter for SDPA opertation. Refer new samples
    • fp16_fwd_with_sink_token.cpp
      • fp16_bwd_with_sink_token.cpp

🔹 Pointwise

  • Added beta support for configurable beta for swish

🔹 Other Updates

  • See PR #163 and #165 for custom DLPack path

✅ Recommended Action: Upgrade to cuDNN Frontend v1.14.1 for full compatibility with cuDNN 9.13.0+, improved SDPA support, additional normalization support, and deviceless graph compilation features.

v1.14.0-release

14 Aug 05:54
a7e19ae

Choose a tag to compare

📢 cuDNN Frontend v1.14 — Release Notes

Preferred version for: cuDNN 9.12.0 and above
Minimum Python version: 3.9 (previously 3.8, now obsolete)
Updated pip wheels: Available for Python 3.13


🚀 Improvements

🔹 SDPA

  • Introduced a unified SDPA node → reduces graph creation latency & simplifies the SDPA graph creation.
    (No API changes required from users.)
  • Improved support checks for SDPA FP8 forward propagation (fprop).
  • Improved support checks for Hopper backward propagation (bprop) to fix a bug (introduced in 9.11) affecting certain large head-dimension combinations of d_qk and d_v.
  • Added new SDPA samples with:
    • Paged prefill
    • Ragged Q tensor decode

🔹 Normalizations


🔹 Matmul

  • Added Python sample for low-precision FP8/FP4 matrix multiplications.

🔹 Other Updates


Recommended Action: Upgrade to cuDNN Frontend v1.14,0 for full compatibility with cuDNN 9.12.0+, improved SDPA support, additional normalization support, and deviceless graph compilation features.

v1.13.0-release

17 Jul 05:05
9793df5

Choose a tag to compare

cudnn frontend v1.13 release notes

cudnn frontend v1.13 is the preferred cudnn frontend version for cudnn version 9.11.0 and above.

New API

Introduces device descriptor, which allows for device-less compilation of cudnn graph on a target GPU. See newly added sample and documentation.

Improvements

SDPA

  • Introduced generate_stats as a replacement for is_inference, to improve clarity. When generate_stats is true, the output will contain the stats tensor. When migrating from is_inference (which is now deprecated), note that generate_stats has the opposite meaning, so pass it the negation of the bool that was passed to is_inference.

  • Improved support checks for left and right diagonal bands in conjunction with the diagonal alignment.

  • Improved error handling for large head dimension (d > 128) in sdpa bprop.

Normalizations

Others

  • Published improved SDPA training benchmarks for fp8 and fp16/bf16 graph patterns.

  • Enable int4 Weight only Quantization for matmul. See example

  • Allow block scale dequantize (required for low precision matmul) to take 2-D scale factor.

  • Allow reductions to accept deterministic as a attribute.

  • Added pybinds for block scale dequantize.

Bug Fixes

  • Fixed the sliding window attn_score_modifier function allowing it to set true negative infinity.

v1.12.1 release

09 Jul 00:43
f937055

Choose a tag to compare

This release builds on top of the 1.12.0 release.

Bug fix

  • Fixes an issue where d=256 was marked not supported in Hopper

Minor Enhancements

  • Addressed several comments from code review.
  • Improved the cmake workflow. See PR 125

Benchmark Results

  • Published results of using cuDNN backend for default torch.sdpa op in comparison to other backend. See Llama-3.2-1B-Training for reference.
  • Published comparison results of sdpa() in comparison to other backends. See sdpa_benchmark_bf16_training

v1.12.0 release

19 May 20:59
666996f

Choose a tag to compare

cudnn frontend v1.12 release notes

cudnn frontend v1.12 is the preferred cudnn frontend version for cudnn version 9.9.0 and above.

cudnn_frontend v1.12 is the minimum cudnn frontend version required to work with cuda 13.0 and above

Update the dlpack version and cmake minimum required version to be 3.18

New API

  • Allows compilation and loading of cudnn frontend with cudnn-jit packages.

  • Introduce Adaptive Layernorm (fprop and bprop) operation in cudnn.

std::array<std::shared_ptr<Tensor_attributes>, 3>
adalayernorm(std::shared_ptr<Tensor_attributes>& input,
                         std::shared_ptr<Tensor_attributes>&  scale,
                         std::shared_ptr<Tensor_attributes>&  bias,                                                                                                                                                                                   
                         AdaLayernorm_attributes attributes);

std::array<std::shared_ptr<Tensor_attributes>, 3> adalayernorm_backward(
                         std::shared_ptr<Tensor_attributes>  dy,
                         std::shared_ptr<Tensor_attributes>   x,
                         std::shared_ptr<Tensor_attributes>  scale,                                                                                                                                                    
                         AdaLayernorm_backward_attributes   options);

Please refer to samples for usage.

  • cudnn frontend python API introduces two decorator function cudnn.jit and cudnn.graph for simpler graph creation in python. Refer the matmul sample for usage.

Improvements

SDPA

  • Allows large embedded dimension (d > 128) for fprop across Ampere, Hopper, and Blackwell architectures for bf16/fp16.

  • Added better validation checks for sliding window attention for cudnn version 9.9.0 and below.

  • Sliding windown attention now supports cases when s_q > s_kv

  • sdpa_fp8 operation now pads correctly with negative infinity on masking operation rather than high negative value. This improves the numerical stability of the sdpa operation with fp8 data type.

  • Paged attention now supports page tables in a packed format

Normalizations

  • Allow zero-centered scale in layer norm. Refer to this sample for usage.

Others

  • cudnn frontend now supports serialization of dynamic kernel cache.

Bug Fixes

  • Fixed the dlopen of cudart.so to look for the binary with version name.

  • Correctly fail when SDPA bprop is called on Blackwell with embedded dimension (d) > 128.

v1.11.0 release

20 Mar 23:48
8801fd7

Choose a tag to compare

v1.11.0 release

cudnn frontend v1.11 release notes

cudnn frontend v1.11 is the preferred cudnn frontend version for cudnn version 9.8.0 and above. With cuDNN frontend v1.11, the minimum supported cudnn version is 9.0.0.

Note: The FE will continue to build and run with cudnn_v8, until explicitly marked as compilation failure.

New API

  • cudnn frontend v1.11 releases flexible score modifier to the python SDPA API. Samples showcasing soft cap of the attention scores, arrow mask are available in the cudnn_frontend/test/python/test_flexible_sdpa.py file. A sample usage of score modifier is shown below:
        score_mod=partial(
            custom_mask,
            mod_tensor=mod_tensor,
            neg_inf=neg_inf_tensor,
            seq_len_q=seq_len_q,
            seq_len_kv=seq_len_kv,
        )
  • The Concatenate operation merges two or more tensors into one, along the specified axis. The user may also specify an in-place merge.
std::shared_ptr<Tensor_attributes>
concatenate(std::vector<std::shared_ptr<Tensor_attributes>>, Concatenate_attributes);
  • pip wheels compatible with windows x86_64 architecture are now available on pypi.

  • sdpa paged attention API now supports Q tensor to be ragged when used with cudnn version 9.7.0 and above.

Improvements

  • Users can now pass the CMake flag -DCMAKE_CXX_FLAGS="-DNV_CUDNN_FRONTEND_DISABLE_LOGGING" to disable logging in the cuDNN frontend.

  • Adds a new sample to showcase native cudagraph creation from cudnn for sdpa bprop operation. Fixed a bug when using the update_cuda_graph API to update cuda graph for sdpa bprop operation.

  • Updates the create_container_and_page_table example function to use the layout that's desired for the more performant kernel."

Bug Fixes

  • Fixes memory leak in the test harness for some legacy tests that use ragged tensors.

  • Fixes a bug introduced in the benchmarking script that prevented the sdpa cudnn operation from being executed. This was because the use_padding_mask attribute was made mandatory for the sdpa operation. This has been fixed as well.

  • Updates the paged attention sample to not cause illegal memory access when changing the dimensions of the tensors in the sample.

  • Updates the DgradDReluBNBwdWeight sample to perform the right operation for the dgrad + drelu fusion.

v1.10.0 release

28 Jan 05:43
91b7532

Choose a tag to compare

cudnn frontend v1.10 release notes

cudnn frontend v1.10 is the preferred cudnn frontend to be used for
cudnn backend 9.7.0 and later as it adds to the Blackwell specific
features.

New API

  • cudnn Frontend v1.10 introduces two new operators,
    block_scale_quantize and block_scale_dequantize to specify the scaling
    and de-scaling of low precision datatypes supported from Blackwell GPU
    onwards.

  • create_execution_plan(int64_t const engine_id, std::unordered_map<KnobType_t, int64_t> const &knobs) allows creation
    of a custom execution plan with hardcoded engine and knobs. Added a
    sample in samples/cpp/misc/custom_plan.cpp to showcase how to work
    with different Engine and Knobs.

Improvements

  • Users can now query behavior notes of a particular execution plan
    using get_behavior_notes(std::vector<BehaviorNote_t> &notes) const and
    get_behavior_notes_for_plan_at_index(int64_t const index, std::vector<BehaviorNote_t> &notes) const functions.

  • SDPA operations now accept both left window and right window size with
    respect to diagonal. See Attention.md for more details.

  • SDPA operations now accept a diagonal alignment for the Attention
    score matrix to be used describe the above window. When s_q != s_kv,
    and causal mask is on this can be used to specify if the diagonal is top
    left or bottom right.

  • Bottom right causal masking can now be enabled on the sdpa_fp8
    operation.

Bug fixes

  • Fixed a regression in cuDNN FrontEnd v1.9.0 where the softmax node
    would override user-set dims and strides for softmax_stats and m_zinv.
    This also affected sdpa_forward and sdpa_fp8_forward node

New samples

  • Added an example to showcase how native cuda graphs can be constructed
    from the SDPA operation graph.

v1.9.0 release

20 Dec 19:22
ee971b1

Choose a tag to compare

cudnn frontend v1.9 release notes

New API

Enhancements to flash attention API

  • SDPA_attributes and SDPA_bprop_attributes now accepts a score_mod function through set_score_mod and set_score_mod_bprop API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for fprop and bprop.

  • Added support for THD format and sliding window mask.

  • Added support for THD format and Bottom right causal mask.

  • Added support for bottom right causal masking with sliding window mask

  • Added a new parameter called set_max_total_seq_len_q/set_max_total_seq_len_kv on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.

Improvements

  • Allow creation of serialized json for dgrad, wgrad and resample operations.

  • Added more diagnostic message when the compiled version of cudnn does not match the run-time version of cudnn.

Bug fixes

  • Fixed an issue where log messages unparseable data at the end of messages.

  • Fixed an issue where while building the python pip wheel would hang.

  • Fixed natively creating cuda graphs for SDPA with alibi masks.

New samples

  • Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.

v1.8.0 release

23 Oct 18:44
936021b

Choose a tag to compare

cudnn frontend v1.8 release:

New API

Paged Attention API

SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes now accepts set_paged_attention_k_table and set_paged_attention_v_table to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper

cuda Graph API

cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:

  • populate_cuda_graph : add the cudnn nodes to the empty cuda_graph provided as input.
  • update_cuda_graph : update the populated cuda graph with necessary data pointers.
    See docs and backend documentation for more details.

Enhancements

  • Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.

  • graph.deselect_engines(str: ) has now a python equivalent through pybind11.

  • graph.tensor(...) can now accept int64_t scalars directly. (Previously limited to int32_t, float and fp16 data types).

  • fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.

  • More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.

  • SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.

Bug fixes

  • Fixed an issue while querying cudnnGetLastErrorString() from the backend. The error_t object will now have more meaningful message.

  • Fixed build issues seen with clang-19 compiler.

  • Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.