[ROCm] Limit number of values per thread for reductions on three dimensions #159652

doru1004 · 2025-08-01T16:35:06Z

In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-08-01T16:35:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159652

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 70792b5 with merge base 1465757 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-cuda12.8-py3.10-gcc11-test / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh) (disabled by #147726 but the issue was closed recently and a rebase is needed to make it pass)
distributed/test_c10d_functional_native.py::CompileTest::test_inductor_all_reduce_coalesced

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

petrex

Question : Was the choice of 2048 as the threshold for "values per thread" purely heuristic? It would be helpful to add a comment or reference explaining why this value was chosen and whether it is empirically optimal.

petrex

Another question : Is there an upper bound for config.ctas_per_output *= 2;

jerrymannil · 2025-08-01T17:30:45Z

Reproducer:

import time
import torch

shapes = [(1, 2, 3, 420, 648, 128),
    (1, 2, 3, 420, 648, 128),
    (5079670, 128)
]

dims = [(3,4),
    (-3, -2), 
    (1)
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.float)
    for _ in range(20):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()

    start_time = time.time()
    for _ in range(100):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    end_time = time.time()
    mean_time = (end_time - start_time)/100
    print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us")

Before
Avg time for shape (1, 2, 3, 420, 648, 128): 4408.10 us
Avg time for shape (1, 2, 3, 420, 648, 128): 4428.89 us
Avg time for shape (5079670, 128): 1458.86 us

After:
Avg time for shape (1, 2, 3, 420, 648, 128): 223.73 us
Avg time for shape (1, 2, 3, 420, 648, 128): 218.85 us
Avg time for shape (5079670, 128): 1461.55 us

pytorch-bot · 2025-08-01T19:16:49Z

To add the ciflow label ciflow/periodic-rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

doru1004 · 2025-08-04T10:29:59Z

Question : Was the choice of 2048 as the threshold for "values per thread" purely heuristic? It would be helpful to add a comment or reference explaining why this value was chosen and whether it is empirically optimal.

It was indeed empirically determined. I'll add a comment.

doru1004 · 2025-08-04T10:51:32Z

Another question : Is there an upper bound for config.ctas_per_output *= 2;

From the previous semantics there doesn't seem to be the case.

…nsions (#2460) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. cherry-pick of pytorch#159652

jerrymannil · 2025-08-12T17:09:35Z

@pytorchbot merge

pytorch-bot · 2025-08-12T17:09:43Z

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorchmergebot · 2025-08-12T17:12:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…nsions (#159652) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: #159652 Approved by: https://github.com/jeffdaily

…nsions (pytorch#159652) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: pytorch#159652 Approved by: https://github.com/jeffdaily

…nsions (#159652) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: #159652 Approved by: https://github.com/jeffdaily

…nsions (pytorch#159652) In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high. Pull Request resolved: pytorch#159652 Approved by: https://github.com/jeffdaily

doru1004 requested review from eqy and syed-ahmed as code owners August 1, 2025 16:35

pytorch-bot bot added the release notes: cuda release notes category label Aug 1, 2025

pytorchbot added the open source label Aug 1, 2025

petrex reviewed Aug 1, 2025

View reviewed changes

pytorch-bot bot removed the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Aug 1, 2025

pruthvistony added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Aug 1, 2025

pruthvistony requested review from jeffdaily, jithunnair-amd and pruthvistony August 1, 2025 19:17

Limit number of values per thread for reductions on three dimensions

70792b5

doru1004 force-pushed the fix-3d-reduction branch from 9d82d7d to 70792b5 Compare August 4, 2025 16:19

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 labels Aug 4, 2025

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 4, 2025

doru1004 changed the title ~~[AMDGPU] Limit number of values per thread for reductions on three dimensions~~ [ROCm] Limit number of values per thread for reductions on three dimensions Aug 5, 2025

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Aug 5, 2025

jerrymannil mentioned this pull request Aug 5, 2025

[ROCm] Limit number of values per thread for reductions on three dimensions ROCm/pytorch#2460

Merged

jeffdaily approved these changes Aug 12, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025

pytorchmergebot added the merging label Aug 12, 2025

pytorchmergebot added the Merged label Aug 12, 2025

pytorchmergebot closed this in f27232a Aug 12, 2025

pytorchmergebot removed the merging label Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

Uh oh!

doru1004 commented Aug 1, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 1, 2025 •

edited

Loading

Uh oh!

petrex left a comment •

edited

Loading

Uh oh!

petrex left a comment

Uh oh!

jerrymannil commented Aug 1, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 1, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

jerrymannil commented Aug 12, 2025

Uh oh!

pytorch-bot bot commented Aug 12, 2025

Uh oh!

pytorchmergebot commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

[ROCm] Limit number of values per thread for reductions on three dimensions #159652

Uh oh!

Conversation

doru1004 commented Aug 1, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159652

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

petrex left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petrex left a comment

Choose a reason for hiding this comment

Uh oh!

jerrymannil commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 1, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

doru1004 commented Aug 4, 2025

Uh oh!

jerrymannil commented Aug 12, 2025

Uh oh!

pytorch-bot bot commented Aug 12, 2025

Uh oh!

pytorchmergebot commented Aug 12, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

doru1004 commented Aug 1, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 1, 2025 •

edited

Loading

petrex left a comment •

edited

Loading

jerrymannil commented Aug 1, 2025 •

edited

Loading