KEMBAR78
[ROCm] Limit number of values per thread for reductions on three dimensions by jerrymannil · Pull Request #2460 · ROCm/pytorch · GitHub
Skip to content

Conversation

@jerrymannil
Copy link
Collaborator

@jerrymannil jerrymannil commented Aug 5, 2025

In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

cherry-pick of pytorch#159652

Cherry-picked to release/2.8 branch via #2469

@jerrymannil jerrymannil self-assigned this Aug 5, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Aug 5, 2025

Jenkins build for 53fbf7866d00ae012510c041cb5c3e13d3a1c214 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jerrymannil
Copy link
Collaborator Author

reproducer details at pytorch#159652 (comment)

Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check for any regressing models, may need to rebase the numbers.

@jerrymannil jerrymannil merged commit 5cd45f9 into release/2.7 Aug 6, 2025
2 of 6 checks passed
@jerrymannil jerrymannil deleted the 2.7_reduce_sum_fix branch August 6, 2025 18:13
@jerrymannil
Copy link
Collaborator Author

! cherry-pick --onto release/2.8

okakarpa pushed a commit that referenced this pull request Aug 6, 2025
…nsions (#2460)

In the current implementation of reductions in three dimensions for AMD
GPUs the number of values per thread is unbounded and can end up being
in the hundreds of thousands for certain tensors. This of course is bad
for performance. This patch fixes this issue by increasing the
parallelism and thus lowering the number of value per thread to
reasonable limits i.e. less than 2048 values per thread. The performance
gains can be between 10x-17x for certain examples where the number of
values per thread was originally very high.

cherry-pick of pytorch#159652
@okakarpa
Copy link
Collaborator

okakarpa commented Aug 6, 2025

Created branch autogenerated/release/2.8_cherry-pick_pr-2460 and #2469

jerrymannil added a commit that referenced this pull request Aug 6, 2025
…d for reductions on three dimensions (#2469)

Cherry-pick of #2460

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
tvukovic-amd pushed a commit that referenced this pull request Aug 20, 2025
…d for reductions on three dimensions (#2469)

Cherry-pick of #2460

Co-authored-by: Jerry Mannil <65309407+jerrymannil@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants