[ROCm] Bump AOTriton to 0.11b #161754

xinyazhang · 2025-08-29T00:51:19Z

Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:

Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
- AITER ASM kernels deliver over 500TFLOPS training performance. See
  AOTriton 0.11b Release Page for more
  details.
Now returns natural based logsumexp tensor, matching CUDA's behavior
- PR [ROCm] logsumexp on ROCm needs scaling back to natural base. #156903 is reverted in this PR as well since it is not needed anymore.
Enables CausalVariant.LOWER_RIGHT

The build system changes drastically along with new packaging scheme of
AOTriton 0.11

AOTriton 0.11 packs GPU images separately from AOTriton runtime
aotriton.cmake now selectively downloads image packs according to
PYTORCH_ROCM_ARCH
aotriton.cmake now only use pre-compiled runtime library that exactly
matches the ROCM in the build environment. For PyTorch builds with ROCm
versions not listed in the file, the build process will build AOTriton
runtime without GPU images from source
- This avoids any further ABI breaks like ROCM 6.4 -> 7.0
- recursive git clone is disabled since building AOTriton runtime does not
  require submodules.

Bug fixes:

Fix a kernel bug introduced when implementing SWA

Known Problems:

gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
due to accuracy issues. Triton compiler fixes are needed to restore the
support status.
Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
This issue is under investigation.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

…OCM" This reverts commit e6ff323.

…ytorch#156903)" This reverts commit 823e223.

pytorch-bot · 2025-08-29T00:51:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161754

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 008f831 with merge base 403a3a3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jithunnair-amd · 2025-09-03T06:21:41Z

rocm-mi300 workflow passed (except for one unrelated timeout): https://hud.pytorch.org/pytorch/pytorch/pull/161754?sha=0add8c2ad827f5562b94b22a1e17aa3d5092951d#rocm-mi300

rocm workflow passed: https://hud.pytorch.org/pytorch/pytorch/pull/161754?sha=0add8c2ad827f5562b94b22a1e17aa3d5092951d#rocm

trunk passed: https://hud.pytorch.org/pytorch/pytorch/pull/161754?sha=0add8c2ad827f5562b94b22a1e17aa3d5092951d#trunk

Latest commit 008f831 merely moves the message, so test results from previous commit should be good enough. Merging to allow sufficient time for internal builds to adjust to aotriton 0.11b changes.

jithunnair-amd · 2025-09-03T06:23:05Z

@jeffdaily please approve and merge

jeffdaily · 2025-09-03T16:52:23Z

@pytorchbot merge

pytorchmergebot · 2025-09-03T16:54:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ScottTodd · 2025-09-04T22:03:28Z

cmake/External/aotriton.cmake

+      CMAKE_CACHE_ARGS
      -DAOTRITON_TARGET_ARCH:STRING=${PYTORCH_ROCM_ARCH}
+      -DCMAKE_INSTALL_PREFIX:FILEPATH=${__AOTRITON_INSTALL_DIR}
+      CMAKE_ARGS


This aotriton_build_from_source function is at least going to need -DHIP_PLATFORM=amd set as well, to avoid errors on Linux and Windows like https://github.com/ROCm/TheRock/actions/runs/17467234227/job/49606011053#step:11:66811

[1902/7918] Performing configure step for 'aotriton_runtime' CMake Error at /opt/python/cp313-cp313/lib/python3.13/site-packages/_rocm_sdk_devel/lib/cmake/hip/hip-config.cmake:144 (message): Unexpected HIP_PLATFORM: Call Stack (most recent call first): CMakeLists.txt:64 (find_package)

The source for that hip-config.cmake is https://github.com/ROCm/rocm-systems/blob/2202dcfe806766804648a9f38de35f555351e7fa/projects/clr/hipamd/hip-config.cmake.in#L111-L121

if(HIP_PLATFORM STREQUAL "amd") set(HIP_RUNTIME "rocclr") set(HIP_COMPILER "clang") include( "${hip_LIB_INSTALL_DIR}/cmake/hip/hip-config-amd.cmake" ) elseif(HIP_PLATFORM STREQUAL "nvidia") set(HIP_RUNTIME "cuda") set(HIP_COMPILER "nvcc") include( "${hip_LIB_INSTALL_DIR}/cmake/hip/hip-config-nvidia.cmake" ) else() message(FATAL_ERROR "Unexpected HIP_PLATFORM: " ${HIP_PLATFORM}) endif()

I'm working on rebasing TheRock's downstream patch https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/patches/pytorch/main/pytorch/hipified/0002-Support-FLASH_ATTENTION-MEM_EFF_ATTENTION-via.-aotri.patch . I'll try to split it into baseline Linux fixes like that one and the deeper changes needed for Windows support.

The real problem is

execute_process(COMMAND ${hip_HIPCONFIG_EXECUTABLE} --platform OUTPUT_VARIABLE HIP_PLATFORM OUTPUT_STRIP_TRAILING_WHITESPACE)

does not work on Windows as expected. Normally setting HIP_PLATFORM is not necessary.

Normally setting HIP_PLATFORM is not necessary

In addition, it's not due to missing GPU installation.
I checked with 7.0RC2 docker image. CPU only instances can correctly report the platform as amd.

Huh... we're seeing issues on both Linux and Windows there. I'll debug a bit. I see what you're saying - the logic in there should infer the HIP_PLATFORM CMake variable via hipconfig --platform.

The interesting part is HIP_PLATFORM becomes mandatory even on Linux. From my experience this variable is always auto-configured.

🤦

if("@HIP_INSTALLS_HIPCC@")

is getting templated as this in TheRock's builds:

if("OFF") if (WIN32) set_and_check(hip_HIPCC_EXECUTABLE "${hip_BIN_INSTALL_DIR}/hipcc.exe") set_and_check(hip_HIPCONFIG_EXECUTABLE "${hip_BIN_INSTALL_DIR}/hipconfig.exe") else() set_and_check(hip_HIPCC_EXECUTABLE "${hip_BIN_INSTALL_DIR}/hipcc") set_and_check(hip_HIPCONFIG_EXECUTABLE "${hip_BIN_INSTALL_DIR}/hipconfig") endif() endif()

Filed ROCm/TheRock#1402 to solve that. What you have here (omitting -DHIP_PLATFORM=amd) should be fine then.

We have other issues downstream to triage though: ROCm/TheRock#1401 (comment)

We're investigating that and other failures on Linux when building using TheRock at ROCm/TheRock#1408

Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b: * Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements - AITER ASM kernels deliver over 500TFLOPS training performance. See [AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more details. * Now returns natural based `logsumexp` tensor, matching CUDA's behavior - PR pytorch#156903 is reverted in this PR as well since it is not needed anymore. * Enables `CausalVariant.LOWER_RIGHT` The build system changes drastically along with new packaging scheme of AOTriton 0.11 * AOTriton 0.11 packs GPU images separately from AOTriton runtime * `aotriton.cmake` now selectively downloads image packs according to `PYTORCH_ROCM_ARCH` * `aotriton.cmake` now only use pre-compiled runtime library that exactly matches the ROCM in the build environment. For PyTorch builds with ROCm versions not listed in the file, the build process will build AOTriton runtime without GPU images from source - This avoids any further ABI breaks like ROCM 6.4 -> 7.0 - recursive git clone is disabled since building AOTriton runtime does not require submodules. Bug fixes: * Fix a kernel bug introduced when implementing SWA Known Problems: * gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status due to accuracy issues. Triton compiler fixes are needed to restore the support status. * Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0. This issue is under investigation. Pull Request resolved: pytorch#161754 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Fixes: pytorch#163958 Cherry-pick pytorch#161754 Cherry-pick pytorch#162330 Cherry-pick pytorch#163373 Cherry-pick pytorch#163745 Note TF32 support is still being plagued by `HIPBLASLT_ALLOW_TF32`, which should be handled by another PR due to its complexity. --------- Co-authored-by: Aaryaman Vasishta <aaryaman.vasishta@amd.com> Co-authored-by: Scott Todd <scott.todd0@gmail.com>

xinyazhang added 25 commits August 28, 2025 23:45

CHECK_NOSPARSE_CONTIGUOUS_CUDA is needed regardless CUDA or ROCM

85386a1

aotriton: migrate to v3 api

6934275

aotriton: move away from new/delete. Calling them makes me nervous

c194d24

Revert "CHECK_NOSPARSE_CONTIGUOUS_CUDA is needed regardless CUDA or R…

bd31952

…OCM" This reverts commit e6ff323.

fix missing CHECK_NOSPARSE_CONTIGUOUS_CUDA

543da65

fix compiling error.

f7db928

Fix causal and add bottom right align support

01f1c05

enable test bottom right

287cf63

remove seqlen_q == seqlen_k requirement in FA backend

2fb5954

FA defaults to bottom right, matching cuda's behavior

407d6fb

fix ME's causal mask configuration

384634c

Add hip/aotriton_versions.h

829c0ae

Use c10::basic_string_view::starts_with to simplify the code

11913df

draft: new aotriton build system

0be5774

aotriton.cmake: rename things

d161cee

[TO REVERT] test new packaging scheme

2c2611f

fix aotriton.cmake

92ebba2

fix aotriton.cmake build from source

4384a52

Revert "[ROCm] logsumexp on ROCm needs scaling back to natural base. (p…

0b7d856

…ytorch#156903)" This reverts commit 823e223.

revert testing code in aotriton.cmake

b0ce2b7

fill checksums

0f5249c

raise fudge factors.

37b7916

fix test_multiheadattention_fastpath

88e10ab

Disable TF32 on certain tests in test/test_transformers.py on ROCM 7.0

fbe1c8c

raise fudge factor on gfx950

61d8d89

pytorch-bot bot added module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 29, 2025

xinyazhang requested review from eqy and jithunnair-amd August 29, 2025 00:51

aotriton.cmake: move message before downloading

008f831

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 3, 2025

xinyazhang requested a review from jithunnair-amd September 3, 2025 06:08

jithunnair-amd approved these changes Sep 3, 2025

View reviewed changes

jeffdaily approved these changes Sep 3, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 3, 2025

pytorchmergebot added the merging label Sep 3, 2025

pytorchmergebot added the Merged label Sep 3, 2025

pytorchmergebot closed this in 98efc9e Sep 3, 2025

pytorchmergebot removed the merging label Sep 3, 2025

digitaltopo mentioned this pull request Sep 3, 2025

[Issue]: [ROCm] Multiple Issues on Ubuntu 24.04 LTS with ComfyUI: Latest Nightly Crashes, Attention Status Question, and Memory Overflow ROCm/TheRock#1378

Closed

ScottTodd mentioned this pull request Sep 4, 2025

[Issue]: Failure to Apply HIPification Patch 0002-Support-FLASH_ATTENTION-MEM_EFF_ATTENTION-via.-aotri.patch ROCm/TheRock#1393

Closed

naromero77amd mentioned this pull request Sep 4, 2025

-DUSE_ROCM=ON -DUSE_FLASH_ATTENTION=OFF fails to build: undeclared identifier 'CHECK_NOSPARSE_CONTIGUOUS_CUDA' #160826

Closed

ScottTodd reviewed Sep 4, 2025

View reviewed changes

ScottTodd mentioned this pull request Sep 4, 2025

[Issue]: hip-config.cmake in _rocm_sdk_devel has HIP_INSTALLS_HIPCC set to OFF ROCm/TheRock#1402

Open

xinyazhang mentioned this pull request Sep 12, 2025

[Issue]: Linux 'nightly' PyTorch builds broken by aotriton changes ROCm/TheRock#1408

Open

xinyazhang mentioned this pull request Sep 29, 2025

[ROCM] Backport AOTriton 0.11b to release/2.8 and other fixes ROCm/pytorch#2686

Merged

benrichard-amd mentioned this pull request Oct 1, 2025

Remove irrelevant aotriton.images architectures from pytorch WHLs ROCm/TheRock#1077

Closed

[ROCm] Bump AOTriton to 0.11b #161754

[ROCm] Bump AOTriton to 0.11b #161754

Uh oh!

Conversation

xinyazhang commented Aug 29, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161754

✅ No Failures

Uh oh!

jithunnair-amd commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jithunnair-amd commented Sep 3, 2025

Uh oh!

jeffdaily commented Sep 3, 2025

Uh oh!

pytorchmergebot commented Sep 3, 2025

Merge started

Uh oh!

ScottTodd Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

xinyazhang Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinyazhang Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

xinyazhang Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ScottTodd Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xinyazhang commented Aug 29, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 29, 2025 •

edited

Loading

jithunnair-amd commented Sep 3, 2025 •

edited

Loading

xinyazhang Sep 4, 2025 •

edited

Loading