KEMBAR78
Enable MI355X PyTorch CI testing. by saienduri · Pull Request #158889 · pytorch/pytorch · GitHub
Skip to content

Conversation

@saienduri
Copy link
Collaborator

@saienduri saienduri commented Jul 23, 2025

This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.

  • Rework aotriton cmake configuration to rely on HIP_VERSION instead of ROCM_VERSION as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
  • Bump composable-kernel submodule to df6023e305f389bbf7249b0c4414e649f3ad6598 for mi350 compatibility.
  • Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
  • Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
  • Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: https://hud.pytorch.org/pytorch/pytorch/commit/ca7d5fae112558ee3dde7ec3ce32e94b13f877fd#rocm-mi300

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@saienduri saienduri requested a review from a team as a code owner July 23, 2025 00:40
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158889

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 2 Cancelled Jobs, 1 Unrelated Failure

As of commit ccabf80 with merge base bc379ae (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: releng release notes category labels Jul 23, 2025
@saienduri saienduri added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 23, 2025
@saienduri saienduri changed the title Enable MI355X ROCm CI testing. Enable MI355X PyTorch CI testing. Jul 23, 2025
@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge -f "CI failures unrelated. 2 failing ROCm CI jobs were due to unrelated timeouts"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

lamikr added a commit to ROCm/pytorch that referenced this pull request Jul 25, 2025
Original patch from saienduri <saimanas.enduri@amd.com>

This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.

- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](https://github.com/ROCm/composable_kernel/tree/df6023e305f389bbf7249b0c4414e649f3ad6598) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: https://hud.pytorch.org/pytorch/pytorch/commit/ca7d5fae112558ee3dde7ec3ce32e94b13f877fd#rocm-mi300

Unlike the original patch, this version does not change the __AOTRITON_SHA256_LIST
for rocm 6.5. (Change would cause sha256 error after aotriton download)

fixes: ROCm/TheRock#1119

Signed-off-by: Mika Laitio <mika.laitio@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 Merged module: rocm AMD GPU support for Pytorch open source release notes: releng release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants