[ROCm CI] Migrate to MI325 Capacity. #159059

saienduri · 2025-07-24T17:25:24Z

This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label linux.rocm.gpu.gfx942.<#gpus> with this PR as well to reduce overhead and confusion.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-07-24T17:25:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159059

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit f430e83 with merge base 67e68e0 ():

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 5, 6, linux.rocm.gpu.gfx942.2, unstable) (gh)
inductor/test_helion_kernels.py::HelionTests::test_add_kernel

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-07-24T17:25:29Z

The committers listed above are authorized under a signed CLA.

✅ login: saienduri / name: Sai Enduri (0dacfcf, bce8608, 3f8867e, 5290a57, 7bf32fb, d98bbd0, 5d8fd0a, 5dde5cf, d38f190, b48c035, a1b0e93, decda43, f430e83, 20f0347, b1f9948, 9259a0d, 61c791f, 4f34382, e0abf11)
✅ login: deedongala (7f511fb, 266abb2, 8d55b60, 61a20c3, 9c2bb9d, 1e27422, 6ea9306, c517bfe, 570dd7c, 44b277e, 6a7a186, fdb49f4, c17b52b, ec1d936)

atalman · 2025-07-30T18:14:58Z

@pytorchmergebot merge -i

pytorchmergebot · 2025-07-30T18:17:23Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable), rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 5, 6, linux.rocm.gpu.gfx942.2, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jithunnair-amd · 2025-07-30T19:45:23Z

@pytorchbot merge -f "Force merging since this is time-sensitive (losing MI300 capacity today EOD), and we already verified the full ROCm workflows"

pytorchmergebot · 2025-07-30T19:45:40Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jithunnair-amd · 2025-07-30T19:45:53Z

@pytorchbot merge -f "Force merging since this is time-sensitive (losing MI300 capacity today EOD), and we already verified the full ROCm workflows"

pytorchmergebot · 2025-07-30T19:47:38Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion. Pull Request resolved: #159059 Approved by: https://github.com/jithunnair-amd, https://github.com/atalman Co-authored-by: deedongala <deekshitha.dongala@amd.com>

Migrate mi300s to gfx942. Related to #159059 Pull Request resolved: #159649 Approved by: https://github.com/huydhn

Migrate mi300s to gfx942. Related to pytorch#159059 Pull Request resolved: pytorch#159649 Approved by: https://github.com/huydhn

deedongala and others added 10 commits July 24, 2025 09:28

Update actionlint.yaml

44b277e

Update and rename rocm-mi300.yml to rocm-mi325.yml

266abb2

Update and rename periodic-rocm-mi300.yml to periodic-rocm-mi325.yml

c17b52b

Update and rename inductor-rocm-mi300.yml to inductor-rocm-mi325.yml

6a7a186

Update inductor-perf-test-nightly-rocm.yml

fdb49f4

Update and rename inductor-rocm-mi325.yml to inductor-rocm-mi300.yml

1e27422

Update and rename periodic-rocm-mi325.yml to periodic-rocm-mi300.yml

570dd7c

Update and rename rocm-mi325.yml to rocm-mi300.yml

61a20c3

Trigger CI

e0abf11

Trigger CI

0dacfcf

saienduri requested a review from a team as a code owner July 24, 2025 17:25

pytorch-bot bot added the topic: not user facing topic category label Jul 24, 2025

facebook-github-bot added the module: rocm AMD GPU support for Pytorch label Jul 24, 2025

saienduri mentioned this pull request Jul 24, 2025

[DO NOT MERGE] Test new MI325X capacity #159053

Closed

saienduri added ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Jul 24, 2025

Trigger CI

a1b0e93

pytorchbot added the open source label Jul 24, 2025

pytorch-bot bot removed ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 25, 2025

Trigger CI

5d8fd0a

saienduri added ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 25, 2025

Trigger CI

bce8608

pytorch-bot bot removed the ciflow/inductor-rocm Trigger "inductor" config CI on ROCm label Jul 25, 2025

Address lint

9259a0d

pytorch-bot bot removed ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 29, 2025

Trigger CI

f430e83

saienduri added ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Jul 29, 2025

saienduri requested a review from huydhn July 30, 2025 00:08

atalman approved these changes Jul 30, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 30, 2025

pytorchmergebot added the merging label Jul 30, 2025

pytorchmergebot closed this in 53d68b9 Jul 30, 2025

pytorchmergebot added Merged and removed merging labels Jul 30, 2025

This was referenced Aug 1, 2025

[ROCm CI] Migrate to MI325 Capacity pytorch/ao#2662

Merged

[ROCm CI] Migrate to MI325 Capacity #159649

Closed

[ROCm CI] Migrate to MI325 Capacity pytorch/pytorch-integration-testing#60

Merged

pytorchmergebot pushed a commit that referenced this pull request Aug 4, 2025

[ROCm CI] Migrate to MI325 Capacity (#159649)

1d3eef2

Migrate mi300s to gfx942. Related to #159059 Pull Request resolved: #159649 Approved by: https://github.com/huydhn

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025

[ROCm CI] Migrate to MI325 Capacity (pytorch#159649)

cb2b7a9

Migrate mi300s to gfx942. Related to pytorch#159059 Pull Request resolved: pytorch#159649 Approved by: https://github.com/huydhn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm CI] Migrate to MI325 Capacity. #159059

[ROCm CI] Migrate to MI325 Capacity. #159059

Uh oh!

saienduri commented Jul 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

atalman commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

jithunnair-amd commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

jithunnair-amd commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[ROCm CI] Migrate to MI325 Capacity. #159059

[ROCm CI] Migrate to MI325 Capacity. #159059

Uh oh!

Conversation

saienduri commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159059

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

linux-foundation-easycla bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atalman commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Merge started

Uh oh!

jithunnair-amd commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Uh oh!

jithunnair-amd commented Jul 30, 2025

Uh oh!

pytorchmergebot commented Jul 30, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

saienduri commented Jul 24, 2025 •

edited

Loading

pytorch-bot bot commented Jul 24, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jul 24, 2025 •

edited

Loading