[cuDNN] cuDNN SDPA (Flash Attention) Backward #122510

eqy · 2024-03-22T18:25:12Z

#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs

Will also collect benchmark data,

CC @drisspg

cc @csarofeen @ptrblck @xwang233

pytorch-bot · 2024-03-22T18:25:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122510

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit b53102b with merge base 91d565d ():

NEW FAILURE - The following job has failed:

trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
Process completed with exit code 1.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test_linalg.py::TestLinalgCUDA::test_matmul_small_brute_force_1d_Nd_cuda_float32
periodic / win-vs2019-cuda11.8-py3 / test (default, 3, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2024-03-23T14:50:52Z

aten/src/ATen/native/cudnn/MHA.cpp

+   auto [mha_graph, Q, K, V, attn_scale, Seed, Offset, O, Do, Stats, Dq, Dk, Dv] = graph_and_tensors_backward_values;
+   std::unordered_map<std::shared_ptr<fe::graph::Tensor_attributes>, void*> variant_pack = {
+    // inputs
+   {Q, q.data_ptr()},


shouldn't some of these be const_data_ptr?

Checked this again and it currently seems to be a limitation of cuDNN's variantpack, which only accepts void * pointers.

aten/src/ATen/native/transformers/cuda/attention_backward.cu

Skylion007 · 2024-03-28T18:47:35Z

aten/src/ATen/native/cudnn/MHA.cpp

+     AT_CUDNN_FRONTEND_CHECK(mha_graph->check_support(handle));
+     AT_CUDNN_FRONTEND_CHECK(mha_graph->build_plans(handle));
+     return std::make_tuple(
+       mha_graph, Q, K, V, attn_scale, Seed, Offset, O, DO, STATS, DQ, DK, DV);


Alot of these should be std::move()

albanD · 2024-03-29T02:21:33Z

aten/src/ATen/native/native_functions.yaml

  tags: nondeterministic_seeded

- func: _scaled_dot_product_cudnn_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor philox_seed, Tensor philox_offset)
+- func: _scaled_dot_product_cudnn_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)


Why is this a special function? @jbschlosser and @andrewor14 have spent litteral months removing these cudnn variants for conv and batchnorm. I really don't think we should be doing the same again with sdpa...

I think in this case it's because cuDNN is just a backend in the same way that e.g., flash, mem-efficient, and math are separate backends because the rules for dispatching between them aren't expected to overlap (e.g., cuDNN SDPA isn't expected to exactly cover the support matrix of flash or mem-efficient attention)

Me and @albanD had a long discussion about this on Friday. Historically there has been a similar pattern for aten.op bloat followed by the need to unify a large number of backend behind composite explicit ops. The two exemplars of this are torch's cuDNN op, and batch norm with which consolidation is still ongoing by @andrewor14.

The conclusion was that for now we are okay with paying the backend tech debt (added ops and added coverage surface) for the sake of velocity.

eqy · 2024-04-08T20:14:19Z

@pytorchmergebot rebase

pytorchmergebot · 2024-04-08T20:15:58Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-04-08T20:16:04Z

Successfully rebased cudnn_sdp_backward onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cudnn_sdp_backward && git pull --rebase)

drisspg · 2024-04-16T17:55:38Z

aten/src/ATen/native/cudnn/MHA.cpp

+    const Tensor& dropoutoffset,
+    cudnnHandle_t& handle,
+    MHAParams& params) {
+  auto dtype = fe::DataType_t::HALF;


does this support fp32?

According to these docs: https://docs.nvidia.com/deeplearning/cudnn/latest/developer/graph-api.html#fused-flash-attention-bprop not at the moment, but I'll ask the cuDNN team about the roadmap.

drisspg · 2024-04-16T17:56:31Z

aten/src/ATen/native/cudnn/MHA.cpp

+  }
+  auto mha_graph = std::make_shared<fe::graph::Graph>();
+  mha_graph->set_io_data_type(dtype)
+      .set_intermediate_data_type(fe::DataType_t::FLOAT)


Nit: maybe leave a comment at the top with some of the choices made about the graph construction

drisspg · 2024-04-16T17:57:18Z

aten/src/ATen/native/cudnn/MHA.cpp

+              std::vector<int64_t>(v.strides().begin(), v.strides().end())));
+  auto attn_scale =
+      mha_graph->tensor(fe::graph::Tensor_attributes()
+                            .set_name("attn_scale")


does this support arbitrary attention bias tensors?

I think newer versions (of cuDNN) do, and I will add that in a follow-up. Currently we might be limited by existing builds that use e.g., cuDNN 8.9.2.

drisspg · 2024-04-16T17:58:15Z

aten/src/ATen/native/cudnn/MHA.cpp

+              std::vector<int64_t>(o.strides().begin(), o.strides().end())));
+  auto STATS = mha_graph->tensor(
+      fe::graph::Tensor_attributes()
+          .set_name("stats")


nit: The softmaxstats is the logsumexp of the attention scores right? maybe softmaxstats is the nomenclature for cuDNN thats fine but if you wanted to align with the other kernels

drisspg · 2024-04-16T17:58:49Z

aten/src/ATen/native/cudnn/MHA.cpp

+          .set_stride(
+              std::vector<int64_t>(dO.strides().begin(), dO.strides().end())));
+  auto sdpa_backward_options = fe::graph::SDPA_backward_attributes()
+                                   .set_name("flash_attention_backward")


maybe different name? to remove some confusion between the existing flash impl

drisspg · 2024-04-16T18:09:37Z

aten/src/ATen/native/cudnn/MHA.cpp

+  auto workspace_size = mha_graph->get_workspace_size();
+  auto workspace_ptr =
+      c10::cuda::CUDACachingAllocator::get()->allocate(workspace_size);
+  TORCH_INTERNAL_ASSERT(!workspace_size || workspace_ptr.get());


FYI that has bitten some people TORCH_INTERNAL_ASSERT is only active when building with debug. This first assert probably makes sense to be internal, but do you think the execution of the graph should be a normal TORCH_CHECK ?

drisspg · 2024-04-16T18:11:23Z

aten/src/ATen/native/transformers/cuda/attention_backward.cu

+                        dv/*Tensor& dV*/,
+                        philox_seed/*Tensor& dropoutseed*/,
+                        philox_offset/*Tensor& dropoutoffset*/);
+    return std::make_tuple(dq, dk, dv);


drisspg

Looks good, mostly just small nits, and if you have any perf numbers that would be great!

eqy · 2024-04-17T20:50:13Z

Let me see if I can just rerun my existing forward script on forward + backward now...

eqy · 2024-04-18T01:11:06Z

@pytorchmergebot rebase

pytorchmergebot · 2024-04-18T01:12:40Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>

pytorchmergebot · 2024-04-26T22:28:41Z

Successfully rebased cudnn_sdp_backward onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cudnn_sdp_backward && git pull --rebase)

eqy · 2024-04-27T02:45:16Z

@pytorchmergebot merge

pytorchmergebot · 2024-04-27T02:47:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@drisspg

pytorch#113713 currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs Will also collect benchmark data, CC @drisspg Pull Request resolved: pytorch#122510 Approved by: https://github.com/drisspg

This reverts commit 64af899. Reverted pytorch#122510 on behalf of https://github.com/jeanschmidt due to Breaking amd gpu builds ([comment](pytorch#122510 (comment)))

@drisspg

#113713 currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs Will also collect benchmark data, CC @drisspg Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com> Pull Request resolved: #122510 Approved by: https://github.com/drisspg

eqy added module: cudnn Related to torch.backends.cudnn, and CuDNN support module: cuda Related to torch.cuda, and CUDA support in general open source ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: multi-headed-attention labels Mar 22, 2024

eqy requested review from albanD and soulitzer as code owners March 22, 2024 18:25

eqy changed the title ~~[WIP][cuDNN] cuDNN SDPA Backward~~ [WIP][cuDNN] cuDNN SDPA (Flash Attention) Backward Mar 22, 2024

Skylion007 reviewed Mar 23, 2024

View reviewed changes

aten/src/ATen/native/transformers/cuda/attention_backward.cu Show resolved Hide resolved

Skylion007 reviewed Mar 28, 2024

View reviewed changes

albanD reviewed Mar 29, 2024

View reviewed changes

eqy force-pushed the cudnn_sdp_backward branch from 68b1513 to d382f3d Compare April 5, 2024 00:11

pytorchmergebot force-pushed the cudnn_sdp_backward branch from d382f3d to 8bbdfa0 Compare April 8, 2024 20:16

drisspg reviewed Apr 16, 2024

View reviewed changes

eqy and others added 15 commits April 26, 2024 22:28

update

8229520

update

a0513e5

update

64c2d3e

update

b75f4f0

update

3714251

some fixes

acd1a9f

lint

e71a6e8

lint

63fde66

lint

a963602

fixes from review comments

544ae95

add to allow list

2bd9837

Update check_forward_backward_compatibility.py

71cf382

Update check_forward_backward_compatibility.py

4192c8d

Update check_forward_backward_compatibility.py

6e4f6d1

Update aten/src/ATen/native/transformers/cuda/attention_backward.cu

b53102b

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>

pytorchmergebot force-pushed the cudnn_sdp_backward branch from 021fc2b to b53102b Compare April 26, 2024 22:28

pytorchmergebot added the merging label Apr 27, 2024

pytorchmergebot closed this in a866bff Apr 27, 2024

pytorchmergebot removed the merging label Apr 27, 2024

kit1980 removed the Reverted label Apr 29, 2024

[cuDNN] cuDNN SDPA (Flash Attention) Backward #122510

[cuDNN] cuDNN SDPA (Flash Attention) Backward #122510

Uh oh!

Conversation

eqy commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122510

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Apr 8, 2024

Uh oh!

pytorchmergebot commented Apr 8, 2024

Uh oh!

pytorchmergebot commented Apr 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

eqy commented Apr 17, 2024

Uh oh!

eqy commented Apr 18, 2024

Uh oh!

pytorchmergebot commented Apr 18, 2024

Uh oh!

pytorchmergebot commented Apr 26, 2024

Uh oh!

eqy commented Apr 27, 2024

Uh oh!

pytorchmergebot commented Apr 27, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

eqy commented Mar 22, 2024 •

edited

Loading

pytorch-bot bot commented Mar 22, 2024 •

edited

Loading