[ROCm] Improve reduction sum performance #160466

jerrymannil · 2025-08-12T21:36:47Z

Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128

Reproducer:

import time
import torch

shapes = [
    (5079670, 128)
]

dims = [
    (1)
]

for i, shape in enumerate(shapes):
    x = torch.randn(shape, device='cuda', dtype=torch.float)
    for _ in range(10):
        w = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    print(w.size())

    start_time = time.time()
    for _ in range(50):
        _ = torch.sum(x, dims[i])
    torch.cuda.synchronize()
    end_time = time.time()
    mean_time = (end_time - start_time)/50
    print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us")

Before (MI300X):
Avg time for shape (5079670, 128): 1629.99 us

After (MI300X)
Avg time for shape (5079670, 128): 1008.59 us

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

* Use input vectorization for reduction_on_fastest_striding_dimension when dim >= 0

pytorch-bot · 2025-08-12T21:36:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160466

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bff4a8a with merge base 4d419a7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 0 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466

petrex

lgtm

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 0 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136

jeffdaily · 2025-08-13T16:05:20Z

@pytorchbot merge

pytorchmergebot · 2025-08-13T16:07:43Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 0 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: #160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: #160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: pytorch#160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136

* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: pytorch#160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily

[ROCm] Improve reduction sum performance

bff4a8a

* Use input vectorization for reduction_on_fastest_striding_dimension when dim >= 0

jerrymannil requested review from eqy and syed-ahmed as code owners August 12, 2025 21:36

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Aug 12, 2025

pytorchbot added the open source label Aug 12, 2025

jerrymannil mentioned this pull request Aug 12, 2025

[ROCm] Improve reduction sum performance ROCm/pytorch#2492

Merged

petrex approved these changes Aug 12, 2025

View reviewed changes

jeffdaily approved these changes Aug 12, 2025

View reviewed changes

jeffdaily added release notes: rocm mandatorylabel ciflow/rocm Trigger "default" config CI on ROCm and removed release notes: cuda release notes category labels Aug 12, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 13, 2025

pytorchmergebot added the merging label Aug 13, 2025

jerrymannil mentioned this pull request Aug 13, 2025

[ROCm] Improve reduction sum performance ROCm/pytorch#2503

Closed

pytorchmergebot added the Merged label Aug 13, 2025

pytorchmergebot closed this in 70ccdec Aug 13, 2025

pytorchmergebot removed the merging label Aug 13, 2025

jerrymannil deleted the patch-1 branch August 13, 2025 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Improve reduction sum performance #160466

[ROCm] Improve reduction sum performance #160466

Uh oh!

jerrymannil commented Aug 12, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading

Uh oh!

petrex left a comment

Uh oh!

jeffdaily commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[ROCm] Improve reduction sum performance #160466

[ROCm] Improve reduction sum performance #160466

Uh oh!

Conversation

jerrymannil commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160466

✅ No Failures

Uh oh!

petrex left a comment

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Aug 13, 2025

Uh oh!

pytorchmergebot commented Aug 13, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jerrymannil commented Aug 12, 2025 •

edited

Loading

pytorch-bot bot commented Aug 12, 2025 •

edited

Loading