Releases: NVIDIA/cudnn-frontend
v1.15.0-release
cudnn frontend v1.15 release notes
cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.
New API
- Introduced a new
cudnn.Graph
API that enables interoperability betweentorch.tensors
and the cudnn frontend API. Sample code for performing a matmul with bias addition:
B, M, N, K = 16, 128, 128, 512
a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)
with cudnn.Graph(
intermediate_data_type=cudnn.data_type.FLOAT,
compute_data_type=cudnn.data_type.FLOAT,
inputs=["mm::A", "mm::B", "bias::bias"],
outputs=["bias::OUT_0"],
) as graph:
AB = graph.matmul(
name="mm",
A=a_gpu,
B=b_gpu,
)
C = graph.bias(name="bias", input=AB, bias=d_gpu)
C.set_output(True)
c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)
All notebooks under samples/python have been updated to showcase the flexibility of this API.
- cudnn frontend now supports building editable pip wheels in place.
- The cudnn frontend
Graph
now includes awarmup
method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).
Improvements
SDPA
- Introduced
set_score_max
andset_score_sum_exp
to allow the kernel to outputmax attention score
andsum of exponents
. - Updated support surface checks. (SDPA bprop does not support the combination of
s_q==1
ands_kv==1
.) - SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.
Matmul
- Added support for
COMPLEX_FP32
andCOMPLEX_FP64
datatypes. (Requires cuDNN v9.14.0 or later.)
Normalizations
- Updated samples to prioritize
fe::HeurMode_t::A
overfe::HeurMode_t::FALLBACK
.
Others
- Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
- Improved logging to clearly illustrate the different stages of graph creation.
- The
swish
function now accepts aswish_beta
parameter.
Samples
- Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
- Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
- Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.
Bug Fixes
- Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.
Benchmarks
- Updated results with cuDNN 9.13.1 for B200 and GB300.
Issues Resolved
v1.14.1-release
📢 cuDNN Frontend v1.14.1 — Release Notes
🚀 Improvements
🔹 SDPA
- Improved support checks for Hopper backward propagation (bprop) to fix a bug (introduced in
9.11
and fixed in9.13
) affecting certain large head-dimension combinations of d_qk and d_v. - Added support for sink parameter for SDPA opertation. Refer new samples
fp16_fwd_with_sink_token.cpp
fp16_bwd_with_sink_token.cpp
🔹 Pointwise
- Added beta support for configurable beta for swish
🔹 Other Updates
✅ Recommended Action: Upgrade to cuDNN Frontend v1.14.1 for full compatibility with cuDNN 9.13.0+, improved SDPA support, additional normalization support, and deviceless graph compilation features.
v1.14.0-release
📢 cuDNN Frontend v1.14 — Release Notes
Preferred version for: cuDNN 9.12.0 and above
Minimum Python version: 3.9
(previously 3.8
, now obsolete)
Updated pip wheels: Available for Python 3.13
🚀 Improvements
🔹 SDPA
- Introduced a unified SDPA node → reduces graph creation latency & simplifies the SDPA graph creation.
(No API changes required from users.) - Improved support checks for SDPA FP8 forward propagation (fprop).
- Improved support checks for Hopper backward propagation (bprop) to fix a bug (introduced in
9.11
) affecting certain large head-dimension combinations of d_qk and d_v. - Added new SDPA samples with:
- Paged prefill
- Ragged Q tensor decode
🔹 Normalizations
- Added support for fused LayerNorm with ReLU.
- Included sample:
LayerNorm with ReLU bitmask dump
.
🔹 Matmul
- Added Python sample for low-precision FP8/FP4 matrix multiplications.
🔹 Other Updates
- Added Python bindings for deviceless graph compilation.
→ Sample:test_deviceless_aot_compilation.py
- Addressed GitHub issue: #151.
✅ Recommended Action: Upgrade to cuDNN Frontend v1.14,0 for full compatibility with cuDNN 9.12.0+
, improved SDPA support, additional normalization support, and deviceless graph compilation features.
v1.13.0-release
cudnn frontend v1.13 release notes
cudnn frontend v1.13 is the preferred cudnn frontend version for cudnn version 9.11.0 and above.
New API
Introduces device descriptor, which allows for device-less compilation of cudnn graph on a target GPU. See newly added sample and documentation.
Improvements
SDPA
-
Introduced
generate_stats
as a replacement foris_inference
, to improve clarity. Whengenerate_stats
is true, the output will contain the stats tensor. When migrating fromis_inference
(which is now deprecated), note thatgenerate_stats
has the opposite meaning, so pass it the negation of the bool that was passed tois_inference
. -
Improved support checks for left and right diagonal bands in conjunction with the diagonal alignment.
-
Improved error handling for large head dimension (d > 128) in sdpa bprop.
Normalizations
- Added support for fused Layernorm with Relu and samples for Layernorm with relu bitmask dump
Others
-
Published improved SDPA training benchmarks for fp8 and fp16/bf16 graph patterns.
-
Enable int4 Weight only Quantization for matmul. See example
-
Allow block scale dequantize (required for low precision matmul) to take 2-D scale factor.
-
Allow reductions to accept deterministic as a attribute.
-
Added pybinds for block scale dequantize.
Bug Fixes
- Fixed the sliding window attn_score_modifier function allowing it to set true negative infinity.
v1.12.1 release
This release builds on top of the 1.12.0 release.
Bug fix
- Fixes an issue where d=256 was marked not supported in Hopper
Minor Enhancements
- Addressed several comments from code review.
- Improved the cmake workflow. See PR 125
Benchmark Results
- Published results of using cuDNN backend for default
torch.sdpa
op in comparison to other backend. See Llama-3.2-1B-Training for reference. - Published comparison results of sdpa() in comparison to other backends. See sdpa_benchmark_bf16_training
v1.12.0 release
cudnn frontend v1.12 release notes
cudnn frontend v1.12 is the preferred cudnn frontend version for cudnn version 9.9.0 and above.
cudnn_frontend v1.12 is the minimum cudnn frontend version required to work with cuda 13.0 and above
Update the dlpack version and cmake minimum required version to be 3.18
New API
-
Allows compilation and loading of cudnn frontend with cudnn-jit packages.
-
Introduce Adaptive Layernorm (fprop and bprop) operation in cudnn.
std::array<std::shared_ptr<Tensor_attributes>, 3>
adalayernorm(std::shared_ptr<Tensor_attributes>& input,
std::shared_ptr<Tensor_attributes>& scale,
std::shared_ptr<Tensor_attributes>& bias,
AdaLayernorm_attributes attributes);
std::array<std::shared_ptr<Tensor_attributes>, 3> adalayernorm_backward(
std::shared_ptr<Tensor_attributes> dy,
std::shared_ptr<Tensor_attributes> x,
std::shared_ptr<Tensor_attributes> scale,
AdaLayernorm_backward_attributes options);
Please refer to samples for usage.
- cudnn frontend python API introduces two decorator function
cudnn.jit
andcudnn.graph
for simpler graph creation in python. Refer the matmul sample for usage.
Improvements
SDPA
-
Allows large embedded dimension (d > 128) for fprop across Ampere, Hopper, and Blackwell architectures for bf16/fp16.
-
Added better validation checks for sliding window attention for cudnn version 9.9.0 and below.
-
Sliding windown attention now supports cases when s_q > s_kv
-
sdpa_fp8 operation now pads correctly with negative infinity on masking operation rather than high negative value. This improves the numerical stability of the sdpa operation with fp8 data type.
-
Paged attention now supports page tables in a packed format
Normalizations
- Allow zero-centered scale in layer norm. Refer to this sample for usage.
Others
- cudnn frontend now supports serialization of dynamic kernel cache.
Bug Fixes
-
Fixed the dlopen of cudart.so to look for the binary with version name.
-
Correctly fail when SDPA bprop is called on Blackwell with embedded dimension (d) > 128.
v1.11.0 release
v1.11.0 release
cudnn frontend v1.11 release notes
cudnn frontend v1.11 is the preferred cudnn frontend version for cudnn version 9.8.0 and above. With cuDNN frontend v1.11, the minimum supported cudnn version is 9.0.0.
Note: The FE will continue to build and run with cudnn_v8, until explicitly marked as compilation failure.
New API
- cudnn frontend v1.11 releases flexible score modifier to the python SDPA API. Samples showcasing soft cap of the attention scores, arrow mask are available in the cudnn_frontend/test/python/test_flexible_sdpa.py file. A sample usage of score modifier is shown below:
score_mod=partial(
custom_mask,
mod_tensor=mod_tensor,
neg_inf=neg_inf_tensor,
seq_len_q=seq_len_q,
seq_len_kv=seq_len_kv,
)
- The Concatenate operation merges two or more tensors into one, along the specified axis. The user may also specify an in-place merge.
std::shared_ptr<Tensor_attributes>
concatenate(std::vector<std::shared_ptr<Tensor_attributes>>, Concatenate_attributes);
-
pip wheels compatible with windows x86_64 architecture are now available on pypi.
-
sdpa paged attention API now supports Q tensor to be ragged when used with cudnn version 9.7.0 and above.
Improvements
-
Users can now pass the CMake flag
-DCMAKE_CXX_FLAGS="-DNV_CUDNN_FRONTEND_DISABLE_LOGGING"
to disable logging in the cuDNN frontend. -
Adds a new sample to showcase native cudagraph creation from cudnn for sdpa bprop operation. Fixed a bug when using the update_cuda_graph API to update cuda graph for sdpa bprop operation.
-
Updates the create_container_and_page_table example function to use the layout that's desired for the more performant kernel."
Bug Fixes
-
Fixes memory leak in the test harness for some legacy tests that use ragged tensors.
-
Fixes a bug introduced in the benchmarking script that prevented the sdpa cudnn operation from being executed. This was because the
use_padding_mask
attribute was made mandatory for the sdpa operation. This has been fixed as well. -
Updates the paged attention sample to not cause illegal memory access when changing the dimensions of the tensors in the sample.
-
Updates the DgradDReluBNBwdWeight sample to perform the right operation for the dgrad + drelu fusion.
v1.10.0 release
cudnn frontend v1.10 release notes
cudnn frontend v1.10 is the preferred cudnn frontend to be used for
cudnn backend 9.7.0 and later as it adds to the Blackwell specific
features.
New API
-
cudnn Frontend v1.10 introduces two new operators,
block_scale_quantize and block_scale_dequantize to specify the scaling
and de-scaling of low precision datatypes supported from Blackwell GPU
onwards. -
create_execution_plan(int64_t const engine_id, std::unordered_map<KnobType_t, int64_t> const &knobs)
allows creation
of a custom execution plan with hardcoded engine and knobs. Added a
sample insamples/cpp/misc/custom_plan.cpp
to showcase how to work
with differentEngine
andKnobs
.
Improvements
-
Users can now query behavior notes of a particular execution plan
usingget_behavior_notes(std::vector<BehaviorNote_t> ¬es) const
and
get_behavior_notes_for_plan_at_index(int64_t const index, std::vector<BehaviorNote_t> ¬es) const
functions. -
SDPA operations now accept both left window and right window size with
respect to diagonal. See Attention.md for more details. -
SDPA operations now accept a diagonal alignment for the Attention
score matrix to be used describe the above window. Whens_q != s_kv
,
and causal mask is on this can be used to specify if the diagonal is top
left or bottom right. -
Bottom right causal masking can now be enabled on the sdpa_fp8
operation.
Bug fixes
- Fixed a regression in cuDNN FrontEnd v1.9.0 where the softmax node
would override user-set dims and strides for softmax_stats and m_zinv.
This also affected sdpa_forward and sdpa_fp8_forward node
New samples
- Added an example to showcase how native cuda graphs can be constructed
from the SDPA operation graph.
v1.9.0 release
cudnn frontend v1.9 release notes
New API
Enhancements to flash attention API
-
SDPA_attributes
andSDPA_bprop_attributes
now accepts a score_mod function throughset_score_mod
andset_score_mod_bprop
API. The function accepts a custom chain of pointwise operations which operate on the Attention Score Matrix. Some common functors like causal mask, sliding window mask, soft capping etc. have been added to the headers as reference. More examples of usage have been added in samples for fprop and bprop. -
Added support for THD format and sliding window mask.
-
Added support for THD format and Bottom right causal mask.
-
Added support for bottom right causal masking with sliding window mask
-
Added a new parameter called
set_max_total_seq_len_q/set_max_total_seq_len_kv
on the sdpa bprop node. This will help reduce the workspace size required when running with THD format.
Improvements
-
Allow creation of serialized json for dgrad, wgrad and resample operations.
-
Added more diagnostic message when the compiled version of cudnn does not match the run-time version of cudnn.
Bug fixes
-
Fixed an issue where log messages unparseable data at the end of messages.
-
Fixed an issue where while building the python pip wheel would hang.
-
Fixed natively creating cuda graphs for SDPA with alibi masks.
New samples
- Added a new sample for Layernorm with dynamic shapes and a kernel cache to showcase reduced plan build time when using the kernel cache.
v1.8.0 release
cudnn frontend v1.8 release:
New API
Paged Attention API
SDPA forward operation now supports paged attention on cudnn 9.5.0 and later by setting the appropriate page table descriptors. SDPA_attributes
now accepts set_paged_attention_k_table
and set_paged_attention_v_table
to input these descriptors. Please refer to samples for usage : cpp samples, python samples. See docs for more API details. Paged attention allows for more efficient memory usage by storing K/V caches in non-contiguous memory, and using page tables to reconstruct them. For more information, refer to the cudnn_graph Library, and the Paged Attention paper
cuda Graph API
cudnn graph now allows user to directly build native cuda_graph for given sub_graph (requires cudnn 9.5.0). There are two APIs:
populate_cuda_graph
: add the cudnn nodes to the empty cuda_graph provided as input.update_cuda_graph
: update the populated cuda graph with necessary data pointers.
See docs and backend documentation for more details.
Enhancements
-
Kernel cache for dynamic shapes are now supported in python. Added a sample to showcase usage.
-
graph.deselect_engines(str: )
has now a python equivalent through pybind11. -
graph.tensor(...)
can now acceptint64_t
scalars directly. (Previously limited to int32_t, float and fp16 data types). -
fp8 sdpa attention now allows dropout and padding mask. Requires cudnn 9.5.0 and above.
-
More enhancements to pointwise output stride inferencing (for broadcast operation). For non-unary operands, the broadcasted tensor can now be either at IN_0 or IN_1.
-
SDPA backward operation now allows d upto 256 for Hopper. Requires cudnn 9.5.0 and above.
Bug fixes
-
Fixed an issue while querying
cudnnGetLastErrorString()
from the backend. The error_t object will now have more meaningful message. -
Fixed build issues seen with clang-19 compiler.
-
Fixed an issue where it was assumed a graph with bias in sdpa_bprop will always have a dbias.