[ROCm] No-fence global reduce #161180

amd-hhashemi · 2025-08-21T17:38:24Z

This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-08-21T17:38:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161180

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 59333da with merge base 7376111 ():

NEW FAILURE - The following job has failed:

trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14) (gh)
Build left local git repository checkout dirty

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerrymannil · 2025-08-21T19:12:10Z

This fix provides much better perf that the acquire/release fence solution in #160979
The fence operation is much more heavy weight that atomics with AMD gpus

Reproducer:

import torch

shapes = [(2, 896, 59, 91),
]

dims = [(2, 3),
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.bfloat16)
    x = x.to(memory_format=torch.channels_last)
    for _ in range(20):
        _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
    torch.cuda.synchronize()

    start_evt = torch.cuda.Event(enable_timing=True)
    end_evt = torch.cuda.Event(enable_timing=True)
    start_evt.record()
    for _ in range(100):
        _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16)
    end_evt.record()
    torch.cuda.synchronize()
    print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us")

Results (MI300X):

Before:
Avg time for shape (2, 896, 59, 91): 82.13 us

After:
Avg time for shape (2, 896, 59, 91): 44.36 us

jeffdaily · 2025-08-21T20:28:09Z

aten/src/ATen/native/cuda/Reduce.cuh

+      // Here we preempt need for fences by committing stores to global memory.
+      // We do so by converting the stores to atomics with a return.
+      int constexpr num_int_per_val = sizeof(value)/sizeof(int);
+      CUDA_KERNEL_ASSERT(num_int_per_val>=1);


Since num_int_per_val is a constexpr, we can use a static_assert here.

Suggested change

CUDA_KERNEL_ASSERT(num_int_per_val>=1);

static_assert(num_int_per_val>=1);

removed that assert, and instead handle small value sizes now.

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

@amd-hhashemi

cherry-pick of pytorch#160979 Less-performant fix until pytorch#161180 is finalized * The global reduction path in reduction kernel currently has two threadfence operation * The first threadfence is executed by all threads in all the blocks, whereas the second threadfence is only run by threads in a single block * For AMD gpus, threadfence is a heavy weight operation, esp. when run by all the threads in the system (due to cross-XCD synchronizations) * So using fine-grain fence gives significant performance boost for AMD gpus. * We do a release fence when threads write to reduce buffer in global memory; and then do a acquire fence when threads read from the reduce buffer Co-author: @amd-hhashemi, @jeffdaily **Reproducer**: ```import time import torch shapes = [(2, 896, 59, 91), ] dims = [(2, 3), ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.bfloat16) x = x.to(memory_format=torch.channels_last) for _ in range(20): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) torch.cuda.synchronize() start_evt = torch.cuda.Event(enable_timing=True) end_evt = torch.cuda.Event(enable_timing=True) start_evt.record() for _ in range(100): _ = torch.sum(x, dims[i], keepdim=True, dtype=torch.bfloat16) end_evt.record() torch.cuda.synchronize() print(f"Avg time for shape {shape}: {start_evt.elapsed_time(end_evt) / 100 * 1e3:.2f} us") ``` Fixes SWDEV-545710

This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.

aten/src/ATen/native/cuda/KernelUtils.cuh

pytorchmergebot · 2025-08-26T12:37:36Z

Successfully rebased no_fnc_glb_rdc onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout no_fnc_glb_rdc && git pull --rebase)

jerrymannil · 2025-08-26T20:18:22Z

@pytorchbot merge

pytorchmergebot · 2025-08-26T20:20:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-26T20:21:00Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

jeffdaily · 2025-08-26T20:41:57Z

@pytorchbot merge -f "unrelated macos build failure; all other CI including ciflow/trunk is passing"

pytorchmergebot · 2025-08-26T20:43:43Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly. cherry-pick of pytorch#161180

This change removes need for fences in global_reduce by converting the stores to reduce_buffer[] into atomics+return. This is crucial for perf in architectures with split caches (e.g. MI300), where fences are inherently costly. Pull Request resolved: pytorch#161180 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

amd-hhashemi requested review from eqy and syed-ahmed as code owners August 21, 2025 17:38

pytorch-bot bot added the release notes: cuda release notes category label Aug 21, 2025

pytorchbot added the open source label Aug 21, 2025

jerrymannil mentioned this pull request Aug 21, 2025

[ROCm] Use fine-grain fence in reduction #160979

Closed

jeffdaily previously approved these changes Aug 21, 2025

View reviewed changes

jeffdaily added release notes: rocm mandatorylabel ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 and removed release notes: cuda release notes category labels Aug 21, 2025

jeffdaily changed the title ~~No-fence global reduce~~ [ROCm] No-fence global reduce Aug 21, 2025

pytorch-bot bot added module: rocm AMD GPU support for Pytorch and removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Aug 21, 2025

jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Aug 22, 2025

jerrymannil mentioned this pull request Aug 22, 2025

[ROCm] Use fine-grain fence in reduction ROCm/pytorch#2553

Merged

amd-hhashemi closed this Aug 23, 2025

amd-hhashemi reopened this Aug 25, 2025

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Aug 25, 2025

pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Aug 25, 2025

jeffdaily approved these changes Aug 25, 2025

View reviewed changes

aten/src/ATen/native/cuda/KernelUtils.cuh Outdated Show resolved Hide resolved

nit

59333da

pytorchmergebot force-pushed the no_fnc_glb_rdc branch from 5b54c73 to 59333da Compare August 26, 2025 12:37

pytorchmergebot added the merging label Aug 26, 2025

pytorchmergebot removed the merging label Aug 26, 2025

jerrymannil mentioned this pull request Aug 26, 2025

[ROCm] No-fence global reduce ROCm/pytorch#2584

Merged

pytorchmergebot added the merging label Aug 26, 2025

pytorchmergebot closed this in b2db293 Aug 26, 2025

pytorchmergebot added Merged and removed merging labels Aug 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] No-fence global reduce #161180

[ROCm] No-fence global reduce #161180

Uh oh!

amd-hhashemi commented Aug 21, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 21, 2025 •

edited

Loading

Uh oh!

jerrymannil commented Aug 21, 2025

Uh oh!

jeffdaily Aug 21, 2025

Uh oh!

amd-hhashemi Aug 22, 2025

Uh oh!

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

jerrymannil commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

jeffdaily commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	CUDA_KERNEL_ASSERT(num_int_per_val>=1);
	static_assert(num_int_per_val>=1);

[ROCm] No-fence global reduce #161180

[ROCm] No-fence global reduce #161180

Uh oh!

Conversation

amd-hhashemi commented Aug 21, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161180

❌ 1 New Failure

Uh oh!

jerrymannil commented Aug 21, 2025

Uh oh!

jeffdaily Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

amd-hhashemi Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

jerrymannil commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 26, 2025

Merge failed

Uh oh!

jeffdaily commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amd-hhashemi commented Aug 21, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 21, 2025 •

edited

Loading