[ROCm] logsumexp on ROCm needs scaling back to natural base. #156903

xinyazhang · 2025-06-26T00:01:38Z

This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.

After discussion we are not going to release AOTriton 0.10.1 to fix this due to

Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-06-26T00:01:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156903

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

linux.aws.h100.8 instance is down, potentially longer queue on linux.aws.h100

✅ No Failures

As of commit cfa0de7 with merge base 9894d43 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xinyazhang · 2025-06-26T00:04:59Z

@pytorchbot label "topic: not user facing"

functionstackx · 2025-06-26T00:43:09Z

thanks for the fix @xinyazhang

can an unit tests can be added to previous regression in the future?

probably the OG reprod script would be a good unit test as it doesnt take that much time to run #156012

xinyazhang · 2025-06-26T16:28:18Z

can an unit tests can be added to previous regression in the future?

If you mean the logsumexp tensor's behavior alignment with CUTLASS backend, it will be part of AOTriton 0.11 integration PR.

We need to test the behavior change in AOTriton's own UT first.

xinyazhang · 2025-06-26T20:29:04Z

cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) is unstable ATM

functionstackx · 2025-06-26T20:39:24Z

can an unit tests can be added to previous regression in the future?

If you mean the logsumexp tensor's behavior alignment with CUTLASS backend, it will be part of AOTriton 0.11 integration PR.

I think i was more pointing at that a general unit test that context parallel sdpa has the same numerics as single gpu sdpa for both nvidia & amd

XilunWu

overall look good to me!

XilunWu · 2025-07-02T21:21:01Z

torch/distributed/tensor/experimental/_attention.py

+            need_scaling = True
+            # Note: it is possible that CK is seleted but not compiled in the binary.
+            if _is_ck_supported and _preferred_rocm_fa_library() == _CK_BACKEND:
+                # Unsure about CK's behavior, keep logsumexp untouched
+                need_scaling = False
+            if need_scaling:
+                logsumexp *= 0.6931471805599453


is this equivalent to:

if _is_ck_supported and _preferred_rocm_fa_library() == _CK_BACKEND: logsumexp *= 0.6931471805599453

if not(_is_ck_supported and _preferred_rocm_fa_library() == _CK_BACKEND): logsumexp *= 0.6931471805599453

This is the equivalent

I used the more verbose form to make the logic easier to read.

# Overview Previously we were using 2-based logsumexp (L) tensor b/w forward and backward passes to eliminate unnecessary converts. However this causes quite a few problems: * PyTorch's Context Parallelism system requires natural based (e-based) L tensor + See pytorch/pytorch#156012 for the bug report and pytorch/pytorch#156903 for a temporary solution. * AITER ASM backward kernel uses natural based L tensor # Major Changes * [kernel] Return natural based L tensor in forward kernel, and translate to 2-based in backward kernel when loading * [test] Add `test_logsumexp_scaling` to confirm the scaling is correct. * [build] Set `TRITON_STORE_BINARY_ONLY=1` to avoid caching intermediate files. This massively reduces the size of `triton-cache` directory * [compiler] Bump to the latest Triton compiler to avoid the updated kernel causing GPU segment fault in UT `Split-False-l1-dtype2-0.5-CausalOff-64-64-hdim160-5-3` on MI300X

jithunnair-amd · 2025-07-14T02:48:59Z

@pytorchbot merge -f "CI failures unrelated"

pytorchmergebot · 2025-07-14T02:50:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jeffdaily · 2025-07-22T15:28:32Z

@pytorchbot merge

pytorch-bot · 2025-07-22T15:28:47Z

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-07-22T15:28:47Z

To add the ciflow label ciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorchmergebot · 2025-07-22T15:30:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ytorch#156903)" This reverts commit 823e223.

Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR #156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: #161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR pytorch#156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: pytorch#161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

pytorch-bot bot added module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 26, 2025

pytorchbot added the open source label Jun 26, 2025

xinyazhang requested review from XilunWu and fegin and removed request for fegin June 26, 2025 00:04

pytorch-bot bot added the topic: not user facing topic category label Jun 26, 2025

xinyazhang requested review from jeffdaily, jithunnair-amd and pruthvistony June 26, 2025 16:30

xinyazhang marked this pull request as ready for review June 26, 2025 20:28

jeffdaily previously approved these changes Jun 30, 2025

View reviewed changes

jeffdaily changed the title ~~[ROCM] logsumexp on ROCM needs scaling back to natural base.~~ [ROCm] logsumexp on ROCm needs scaling back to natural base. Jun 30, 2025

pytorch-bot bot added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm labels Jun 30, 2025

xinyazhang mentioned this pull request Jul 1, 2025

Return natural based logsumexp and set TRITON_STORE_BINARY_ONLY=1 ROCm/aotriton#108

Merged

XilunWu previously approved these changes Jul 2, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 14, 2025

pytorchmergebot added the Merged label Jul 14, 2025

pytorchmergebot closed this in 1ea9cde Jul 14, 2025

xinyazhang force-pushed the xinyazhang/issue-156012 branch from 4a2fc29 to 21e735c Compare July 21, 2025 17:00

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Jul 21, 2025

xinyazhang requested a review from jeffdaily July 21, 2025 17:00

xinyazhang and others added 3 commits July 21, 2025 17:46

fix lint

930ffae

Fix lint

e929a1c

simplify

cfa0de7

jeffdaily approved these changes Jul 21, 2025

View reviewed changes

pytorch-bot bot added ciflow/trunk Trigger trunk jobs on your pull request ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm labels Jul 22, 2025

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor labels Jul 22, 2025

pytorchmergebot added the merging label Jul 22, 2025

pytorchmergebot closed this in 823e223 Jul 22, 2025

pytorchmergebot removed the merging label Jul 22, 2025

xinyazhang added a commit to ROCm/pytorch that referenced this pull request Aug 28, 2025

Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (p…

67139d5

…ytorch#156903)" This reverts commit 823e223.

xinyazhang added a commit to ROCm/pytorch that referenced this pull request Aug 29, 2025

Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (p…

0b7d856

…ytorch#156903)" This reverts commit 823e223.

xinyazhang mentioned this pull request Aug 29, 2025

[ROCm] Bump AOTriton to 0.11b #161754

Closed

vexilligera mentioned this pull request Sep 26, 2025

ROCm context parallel backward lse not scaled #163958

Open

[ROCm] logsumexp on ROCm needs scaling back to natural base. #156903

[ROCm] logsumexp on ROCm needs scaling back to natural base. #156903

Uh oh!

Conversation

xinyazhang commented Jun 26, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156903

❗ 1 Active SEVs

✅ No Failures

Uh oh!

xinyazhang commented Jun 26, 2025

Uh oh!

functionstackx commented Jun 26, 2025

Uh oh!

xinyazhang commented Jun 26, 2025

Uh oh!

xinyazhang commented Jun 26, 2025

Uh oh!

functionstackx commented Jun 26, 2025

Uh oh!

XilunWu left a comment

Choose a reason for hiding this comment

Uh oh!

XilunWu Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

xinyazhang Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinyazhang Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented Jul 14, 2025

Uh oh!

pytorchmergebot commented Jul 14, 2025

Merge started

Uh oh!

jeffdaily commented Jul 22, 2025

Uh oh!

pytorch-bot bot commented Jul 22, 2025

Uh oh!

pytorch-bot bot commented Jul 22, 2025

Uh oh!

pytorchmergebot commented Jul 22, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

xinyazhang commented Jun 26, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading

xinyazhang Jul 2, 2025 •

edited

Loading

xinyazhang Jul 2, 2025 •

edited

Loading