[cuDNN][SDPA] Match `query`'s memory layout ordering for `output` in cuDNN SDPA #138354

eqy · 2024-10-18T18:54:00Z

~~We might consider more sophisticated logic here but the corresponding logic in other backends doesn't seem to do anything fancy for non BSHD/BHSD cases~~

pytorch/aten/src/ATen/native/transformers/cuda/attention.cu

Line 1145 in ea8ea2f

res = at::empty({B, M, num_heads, Kv}, query.options());

ended up going with a more general approach to much more or less arbitrary layouts

cc @csarofeen @ptrblck @xwang233 @msaroufim @drisspg @mikaylagawarecki

pytorch-bot · 2024-10-18T18:54:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138354

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 27360a9 with merge base 2ce2e4d ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) (gh) (trunk failure)
distributed/_tools/test_fsdp2_mem_tracker.py::TestTrackerFullyShard1DTrainingCompose::test_tracker_with_activation_checkpointing
periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 2, 3, linux.rocm.gpu) (gh) (trunk failure)
distributed/_tensor/test_matrix_ops.py::DistMatrixOpsTest::test_dtensor_mm

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2024-10-18T18:57:40Z

CC @ngimel @Skylion007
When discussing with @drisspg we realized that this fix might cause the grad output stride to no longer match the output's stride in a common cases. Normally this is not an issue but current cuDNN >= v9.5.0 has a bug where the grad output stride is incorrectly assumed to be the same as output stride, and the workaround for this means that if we fix this it may incur an extra .contiguous in the backward until we upgrade to the cuDNN release with this fix. (It's done and should be released soon)

Skylion007 · 2024-10-18T19:03:55Z

CC @ngimel @Skylion007 When discussing with @drisspg we realized that this fix might cause the grad output stride to no longer match the output's stride in a common cases. Normally this is not an issue but current cuDNN >= v9.5.0 has a bug where the grad output stride is incorrectly assumed to be the same as output stride, and the workaround for this means that if we fix this it may incur an extra .contiguous in the backward until we upgrade to the cuDNN release with this fix. (It's done and should be released soon)

Not a problem for the backport as we use a much lower version of CUDNN though?

aten/src/ATen/native/cudnn/MHA.cpp

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

test/test_transformers.py

aten/src/ATen/native/cudnn/MHA.cpp

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

# Summary Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. #138529 2. huggingface/diffusers#9704 3. #138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc atalman cc mikaylagawarecki [ghstack-poisoned]

@atalman

# Summary Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. #138529 2. huggingface/diffusers#9704 3. #138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc @atalman Pull Request resolved: #138522 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet (cherry picked from commit 9a9a0ab)

pytorchmergebot · 2024-11-04T19:24:01Z

Successfully rebased defaultbshd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout defaultbshd && git pull --rebase)

ngimel · 2024-11-04T19:25:48Z

this fix might cause the grad output stride to no longer match the output's stride in a common cases
Out of curiosity, what would be a common case where gradOutput stride doesn't match output? This happens almost always today, because gradOutput would typically be permuted, and output is contiguous.

eqy · 2024-11-04T19:37:41Z

The potential copy due to gradOutput vs. output stride issue should be resolved once 9.5.1 is used w/ wheels and we can gate that behind a cuDNN version check

eqy · 2024-11-04T21:29:38Z

@pytorchmergebot merge

pytorchmergebot · 2024-11-04T21:31:29Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

StrongerXi · 2024-11-05T19:49:59Z

There's an XLA fp8 matmul test failure from this patch?
https://hud.pytorch.org/failure?name=pull+%2F+linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&jobName=linux-focal-py3_9-clang9-xla+%2F+test+%28xla%2C+1%2C+1%2C+lf.linux.12xlarge%29&failureCaptures=%5B%22test_fp8_matmul0+%28torch.float8_e5m2%29%22%2C%22Fp8Test%22%5D

drisspg · 2024-11-05T20:07:51Z

@StrongerXi how?

StrongerXi · 2024-11-05T20:42:51Z

@StrongerXi how?

I have no clue, I saw it on my PR which was odd, and then I saw that it's also failing on main (the above link).

Skylion007 · 2024-11-20T15:19:33Z

The potential copy due to gradOutput vs. output stride issue should be resolved once 9.5.1 is used w/ wheels and we can gate that behind a cuDNN version check

Okay, CUDNN upgrade is available on CUDA 12.6 binaries. Feel free to add the gate in a new PR.

Update `cuDNN SDPA` meta registration to matching memory layout behavior in: #138354 Pull Request resolved: #148921 Approved by: https://github.com/drisspg, https://github.com/jbschlosser

eqy added module: cudnn Related to torch.backends.cudnn, and CuDNN support module: cuda Related to torch.cuda, and CUDA support in general open source topic: bug fixes topic category module: multi-headed-attention labels Oct 18, 2024

eqy added this to the 2.5.1 milestone Oct 18, 2024

eqy requested a review from drisspg October 18, 2024 18:54

eqy requested a review from syed-ahmed as a code owner October 18, 2024 18:54

eqy added the topic: not user facing topic category label Oct 18, 2024

eqy mentioned this pull request Oct 18, 2024

[Performance] [CuDNN-Attention] CuDNN backend should return the output in the same stride order as input Query #138340

Closed

drisspg reviewed Oct 18, 2024

View reviewed changes

aten/src/ATen/native/cudnn/MHA.cpp Outdated Show resolved Hide resolved

drisspg reviewed Oct 18, 2024

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp Outdated Show resolved Hide resolved

drisspg reviewed Oct 18, 2024

View reviewed changes

test/test_transformers.py Outdated Show resolved Hide resolved

ngimel reviewed Oct 18, 2024

View reviewed changes

test/test_transformers.py Outdated Show resolved Hide resolved

eqy changed the title ~~[cuDNN][SDPA] Prefer BSHD by default for packed/non-contig in BHSD query~~ [cuDNN][SDPA] Match query's memory layout ordering for output in cuDNN SDPA Oct 18, 2024

drisspg reviewed Oct 18, 2024

View reviewed changes

aten/src/ATen/native/cudnn/MHA.cpp Outdated Show resolved Hide resolved

eqy mentioned this pull request Oct 18, 2024

[Flex Attention] Don't compute fill order to compute stride order just to get fill order back #138376

Closed

malfet reviewed Oct 18, 2024

View reviewed changes

aten/src/ATen/native/cudnn/MHA.cpp Outdated Show resolved Hide resolved

ngimel reviewed Oct 18, 2024

View reviewed changes

aten/src/ATen/native/cudnn/MHA.cpp Show resolved Hide resolved

eqy added ciflow/trunk Trigger trunk jobs on your pull request ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Oct 19, 2024

Skylion007 reviewed Oct 19, 2024

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp Outdated Show resolved Hide resolved

drisspg mentioned this pull request Oct 22, 2024

[SDPA-CUDNN] Make CuDNN Attention Opt in #138522

Closed

pytorchbot mentioned this pull request Oct 22, 2024

[SDPA-CUDNN] Make CuDNN Attention Opt in #138587

Merged

eqy added 9 commits November 4, 2024 19:23

lint

8351fb7

Update MHA.cpp

9c2e329

cleanup permute

d29c855

rework stride matching

a55d601

check in

13a521d

Update MHA.cpp

4d53d98

Update MHA.cpp

447d63b

fix

13957c8

lint

27360a9

pytorchmergebot force-pushed the defaultbshd branch from e2c811b to 27360a9 Compare November 4, 2024 19:24

pytorchmergebot added the merging label Nov 4, 2024

pytorchmergebot added the Merged label Nov 4, 2024

pytorchmergebot closed this in 1565eba Nov 4, 2024

pytorchmergebot removed the merging label Nov 4, 2024

drisspg mentioned this pull request Nov 20, 2024

CuDNN SDPA Issue Tracker #141133

Open

9 tasks

atalman mentioned this pull request Jan 13, 2025

Release 2.6.0 validations checklist and cherry-picks #144503

Closed

73 tasks

eqy mentioned this pull request Mar 10, 2025

fix cuDNN SDPA meta registration #148921

Closed

[cuDNN][SDPA] Match query's memory layout ordering for output in cuDNN SDPA #138354

[cuDNN][SDPA] Match query's memory layout ordering for output in cuDNN SDPA #138354

Uh oh!

Conversation

eqy commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138354

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

eqy commented Oct 18, 2024

Uh oh!

Skylion007 commented Oct 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pytorchmergebot commented Nov 4, 2024

Uh oh!

ngimel commented Nov 4, 2024

Uh oh!

eqy commented Nov 4, 2024

Uh oh!

eqy commented Nov 4, 2024

Uh oh!

pytorchmergebot commented Nov 4, 2024

Merge started

Uh oh!

StrongerXi commented Nov 5, 2024

Uh oh!

drisspg commented Nov 5, 2024

Uh oh!

StrongerXi commented Nov 5, 2024

Uh oh!

Skylion007 commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

[cuDNN][SDPA] Match `query`'s memory layout ordering for `output` in cuDNN SDPA #138354

[cuDNN][SDPA] Match `query`'s memory layout ordering for `output` in cuDNN SDPA #138354

eqy commented Oct 18, 2024 •

edited

Loading

pytorch-bot bot commented Oct 18, 2024 •

edited

Loading

Skylion007 commented Nov 20, 2024 •

edited

Loading