[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135397

hongxiayang · 2024-09-06T22:38:56Z

This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform.
By increasing this parameter, it uses fewer threadblocks and improved the performance for large tensors.

Test:
Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s).

import torch
from triton.testing import do_bench

x = torch.randn(2**30, device='cuda')

ms = do_bench(lambda: x.sum(dim=-1))

bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9)

time_s = ms / 1000

bw_per_second = bandwidth_gbyte / time_s

print(bw_per_second)

Co-author: @carlobertolli

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo

…onfig

pytorch-bot · 2024-09-06T22:38:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135397

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 6cd7c04 with merge base de74aaf ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

rocm / linux-focal-rocm6.1-py3.8 / test (default, 1, 6, linux.rocm.gpu.2) (gh) (similar failure)
functorch/test_control_flow.py::TestControlFlow::test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda
rocm / linux-focal-rocm6.1-py3.8 / test (default, 4, 6, linux.rocm.gpu.2) (gh) (similar failure)
inductor/test_loop_ordering.py::LoopOrderingTest::test_fp8_cast_and_t

This comment was automatically generated by Dr. CI and updates every 15 minutes.

hongxiayang · 2024-09-09T21:39:57Z

HI, @malfet : Can you help to merge this PR? The two test failures are not related. Thank you!

malfet · 2024-09-10T21:01:06Z

@pytorchbot merge -f "Lint is green"

pytorchmergebot · 2024-09-10T21:02:48Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jithunnair-amd · 2024-09-10T21:43:31Z

@pytorchmergebot cherry-pick --help

pytorch-bot · 2024-09-10T21:43:33Z

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: the following arguments are required: --onto, -c/--classification

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

jithunnair-amd · 2024-09-10T21:50:45Z

@pytorchmergebot cherry-pick --onto release/2.5 -c critical

@carlobertolli

…d in reduce config (#135397) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: #135397 Approved by: https://github.com/eqy, https://github.com/malfet (cherry picked from commit eb38ee2)

pytorchbot · 2024-09-10T21:54:39Z

Cherry picking #135397

The cherry pick PR is at #135624 and it is recommended to link a critical cherry pick PR with an issue.

Details for Dev Infra team

Raised by workflow job

@carlobertolli

…d in reduce config (#135397) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: #135397 Approved by: https://github.com/eqy, https://github.com/malfet

@carlobertolli

#1588) …d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet Fixes #ISSUE_NUMBER Co-authored-by: hongxyan <hongxyan@amd.com>

@carlobertolli

) Follow-up to pytorch#135397. AMD gpus perform better with fewer thread blocks. So increase the min_values_per_thread as well. This helped improved [CvT](https://github.com/facebookresearch/FAMBench/tree/main/benchmarks/cvt) benchmark performance on MI300X Co-author: @carlobertolli

@carlobertolli

…d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet

functionstackx · 2024-09-22T19:41:43Z

thanks @hongxiayang ! i can confirm that this fixes it

@carlobertolli

#1588) …d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet Fixes #ISSUE_NUMBER Co-authored-by: hongxyan <hongxyan@amd.com> (cherry picked from commit 4360582)

@carlobertolli

) Follow-up to pytorch#135397. AMD gpus perform better with fewer thread blocks. So increase the min_values_per_thread as well. This helped improved [CvT](https://github.com/facebookresearch/FAMBench/tree/main/benchmarks/cvt) benchmark performance on MI300X Co-author: @carlobertolli (cherry picked from commit c1b6f60)

@carlobertolli

#1588) …d in reduce config (pytorch#135397) Fixes pytorch#132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2**30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: pytorch#135397 Approved by: https://github.com/eqy, https://github.com/malfet Fixes #ISSUE_NUMBER Co-authored-by: hongxyan <hongxyan@amd.com>

@carlobertolli

) Follow-up to pytorch#135397. AMD gpus perform better with fewer thread blocks. So increase the min_values_per_thread as well. This helped improved [CvT](https://github.com/facebookresearch/FAMBench/tree/main/benchmarks/cvt) benchmark performance on MI300X Co-author: @carlobertolli

slow sum optimization by increasing max_values_per_thread in reduce c…

6cd7c04

…onfig

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Sep 6, 2024

hongxiayang requested a review from jeffdaily September 6, 2024 22:39

pytorchbot added the open source label Sep 6, 2024

hongxiayang changed the title ~~[ROCm] slow tensor sum optimization by increasing max_values_per_thread in reduce config~~ [ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config Sep 6, 2024

hongxiayang marked this pull request as ready for review September 6, 2024 22:52

hongxiayang requested review from eqy and syed-ahmed as code owners September 6, 2024 22:52

hongxiayang mentioned this pull request Sep 6, 2024

ROCm MI300X sum() way slower than H100 #132964

Closed

eqy approved these changes Sep 6, 2024

View reviewed changes

hongxiayang requested a review from malfet September 9, 2024 21:39

jithunnair-amd added the rocm This tag is for PRs from ROCm team label Sep 10, 2024

malfet approved these changes Sep 10, 2024

View reviewed changes

pytorchmergebot added the merging label Sep 10, 2024

pytorchmergebot added the Merged label Sep 10, 2024

pytorchmergebot closed this in eb38ee2 Sep 10, 2024

pytorchmergebot removed the merging label Sep 10, 2024

jerrymannil mentioned this pull request Sep 12, 2024

[ROCm] slow torch.sum optimization by increasing max_values_per_threa… ROCm/pytorch#1588

Merged

jerrymannil mentioned this pull request Sep 13, 2024

[ROCm] torch.sum optimization by increasing min_values_per_thread ROCm/pytorch#1591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135397

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135397

Uh oh!

hongxiayang commented Sep 6, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 6, 2024 •

edited

Loading

Uh oh!

hongxiayang commented Sep 9, 2024

Uh oh!

malfet commented Sep 10, 2024

Uh oh!

pytorchmergebot commented Sep 10, 2024

Uh oh!

jithunnair-amd commented Sep 10, 2024

Uh oh!

pytorch-bot bot commented Sep 10, 2024

Uh oh!

jithunnair-amd commented Sep 10, 2024

Uh oh!

pytorchbot commented Sep 10, 2024

Uh oh!

functionstackx commented Sep 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135397

[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config #135397

Uh oh!

Conversation

hongxiayang commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135397

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

hongxiayang commented Sep 9, 2024

Uh oh!

malfet commented Sep 10, 2024

Uh oh!

pytorchmergebot commented Sep 10, 2024

Merge started

Uh oh!

jithunnair-amd commented Sep 10, 2024

Uh oh!

pytorch-bot bot commented Sep 10, 2024

Uh oh!

jithunnair-amd commented Sep 10, 2024

Uh oh!

pytorchbot commented Sep 10, 2024

Cherry picking #135397

Uh oh!

functionstackx commented Sep 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

hongxiayang commented Sep 6, 2024 •

edited

Loading

pytorch-bot bot commented Sep 6, 2024 •

edited

Loading