KEMBAR78

[ROCm] Use opportunistic fastatomics based on hueristics by jerrymannil · Pull Request #159430 · pytorch/pytorch · GitHub

[ROCm] Use opportunistic fastatomics based on hueristics #159430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

jerrymannil wants to merge 1 commit into pytorch:main from jerrymannil:patch-1

Contributor

jerrymannil commented Jul 29, 2025 •

edited

Loading

Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:

import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")

Perf numbers:

Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

jerrymannil requested review from eqy and syed-ahmed as code owners

July 29, 2025 23:29

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159430

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 44e97dd with merge base 1ebcba4 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, lf.linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added module: rocm release notes: cuda labels

jerrymannil marked this pull request as draft

July 29, 2025 23:29

pytorchbot added the open source label

pruthvistony added topic: not user facing ciflow/periodic ciflow/rocm ciflow/inductor-rocm ciflow/rocm-mi300 ciflow/periodic-rocm-mi300 and removed release notes: cuda labels

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/periodic please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/inductor-rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot commented Jul 30, 2025

To add the ciflow label ciflow/periodic-rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot bot removed ciflow/rocm ciflow/periodic ciflow/inductor-rocm ciflow/rocm-mi300 ciflow/periodic-rocm-mi300 labels

pruthvistony added ciflow/periodic rocm ciflow/rocm ciflow/inductor-rocm ciflow/rocm-mi300 ciflow/periodic-rocm-mi300 labels

pruthvistony marked this pull request as ready for review

July 31, 2025 18:54

Contributor Author

jerrymannil commented Jul 31, 2025

@pruthvistony @jithunnair-amd
Updated PR description with reproducer and numbers

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request


          [release/2.7] [ROCm] Use opportunistic fastatomics based on heuristics (

faae1f3

#2438)

* Merge of pytorch#159430
* Opportunistic fast atomics works better will small sizes, since there
is more chance of lanes doing atomics on the same address

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Co-author: @amd-hhashemi

pruthvistony requested review from atalman and malfet

July 31, 2025 21:35

Contributor Author

jerrymannil commented Aug 1, 2025

@pytorchbot rebase

Collaborator

pytorchmergebot commented Aug 1, 2025

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here


          [ROCm] Use opportunistic fastatomics based on hueristics

44e97dd

* Opportunistic fast atomics works better will small sizes, since there is more chance of lanes doing atomics on the same address

Collaborator

pytorchmergebot commented Aug 1, 2025

Successfully rebased patch-1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout patch-1 && git pull --rebase)

pytorchmergebot force-pushed the patch-1 branch from 5cf1e0f to 44e97dd Compare

August 1, 2025 17:00

pytorch-bot bot removed the ciflow/rocm label

Contributor Author

jerrymannil commented Aug 1, 2025

@pytorchbot merge

pytorch-bot bot added the ciflow/trunk label

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Aug 1, 2025

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

pytorchmergebot removed the merging label

jeffdaily approved these changes

View reviewed changes

Contributor Author

jerrymannil commented Aug 12, 2025

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Aug 12, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

ee9f8ba

pytorchmergebot added Merged and removed merging labels

jerrymannil deleted the patch-1 branch

August 12, 2025 19:00

chuanhaozhuge pushed a commit that referenced this pull request


          [ROCm] Use opportunistic fastatomics based on hueristics (#159430)

cc127d0

* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: #159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request


          [ROCm] Use opportunistic fastatomics based on hueristics (pytorch#159430

6e3be13

)

* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: pytorch#159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

chuanhaozhuge pushed a commit that referenced this pull request


          [ROCm] Use opportunistic fastatomics based on hueristics (#159430)

ea1c5d3

* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: #159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request


          [ROCm] Use opportunistic fastatomics based on hueristics (pytorch#159430

37e8a6c

)

* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: pytorch#159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request


          [ROCm] Use opportunistic fastatomics based on hueristics (pytorch#159430

a23c87c

)

* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: pytorch#159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jeffdaily jeffdaily approved these changes

pruthvistony pruthvistony approved these changes

eqy Awaiting requested review from eqy eqy is a code owner

syed-ahmed Awaiting requested review from syed-ahmed syed-ahmed is a code owner

jithunnair-amd Awaiting requested review from jithunnair-amd

malfet Awaiting requested review from malfet

atalman Awaiting requested review from atalman

Labels

ciflow/trunk Merged module: rocm open source rocm topic: not user facing