more aggressive persistent reduction #161055

eellison · 2025-08-20T14:15:11Z

Stack from ghstack (oldest at bottom):

-> more aggressive persistent reduction #161055

Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on this kernel from torch ao.

Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed.

As criteria:

there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input)
we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing
we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved).

Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-08-20T14:15:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161055

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm MI2xx CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

✅ No Failures

As of commit 9887644 with merge base 18b4fdd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: ba5251f Pull Request resolved: #161055

[ghstack-poisoned]

ghstack-source-id: b8ebe30 Pull Request resolved: #161055

[ghstack-poisoned]

ghstack-source-id: 133adb1 Pull Request resolved: #161055

jansel

You have a bechmark run?

Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](#159769 (comment)). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 3f9988e Pull Request resolved: #161055

Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](#159769 (comment)). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 5e03e09 Pull Request resolved: #161055

PaulZhang12

This implies that compilation time could increase in general with an extra config to autotune? Any concerns with this

Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](#159769 (comment)). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: b1ebb75 Pull Request resolved: #161055

eellison · 2025-08-29T15:42:36Z

@pytorchbot merge

pytorchmergebot · 2025-08-29T15:44:36Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

eellison · 2025-08-29T16:12:20Z

@pytorchbot merge

pytorchmergebot · 2025-08-29T16:14:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-29T18:42:28Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch rebase origin/main returned non-zero exit code 1

Rebasing (1/1)
Auto-merging torch/_inductor/runtime/triton_heuristics.py
CONFLICT (content): Merge conflict in torch/_inductor/runtime/triton_heuristics.py
error: could not apply fd9d35d75f4... [WIP] more aggressive persistent reduction (#161055)
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply fd9d35d75f4... # [WIP] more aggressive persistent reduction (#161055)

Details for Dev Infra team

Raised by workflow job

Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](#159769 (comment)). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: c0c8cac Pull Request resolved: #161055

eellison · 2025-08-29T22:02:31Z

@pytorchbot merge

pytorchmergebot · 2025-08-29T22:04:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Gives 18% speedup on rms norm (2048, 32768). And we have seen other instances where inductor is not aggressive enough about codegening persistent reductions - e.g. 39% on [this kernel from torch ao](pytorch#159769 (comment)). Codegen-ing persistent reductions can be risky if you run out of registers. Here, I'm effectively making persistent reductions an option of looped reductions by setting RBLOCK == rnumel, so that we can still fallback to looped reductions as needed. As criteria: - there needs to be significant memory savings from doing a persistent reduction (by keeping memory in register and avoiding another iteration over input) - we should not be coalescing on x dimension, otherwise large rblock will inhibit coalescing - we should not be especially register or arithmetic intensive (this last part uses mem_ops_per_thread, but could be improved). Still need to do dashboard run, although I'm not sure we get a lot of large rblock in our benchmarks. Pull Request resolved: pytorch#161055 Approved by: https://github.com/jansel

shunting314 · 2025-10-17T23:32:32Z

torch/_inductor/runtime/triton_heuristics.py

+
+    def __init__(self, *args, dynamic_scale_rblock=True, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.dynamic_scale_rblock = dynamic_scale_rblock


is this used anywhere? @eellison

Update

ca99860

[ghstack-poisoned]

eellison added a commit that referenced this pull request Aug 20, 2025

[WIP] more aggressive persistent reduction

9b7ba6b

ghstack-source-id: ba5251f Pull Request resolved: #161055

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 20, 2025

eellison requested a review from ngimel August 20, 2025 17:14

Update

53268f4

[ghstack-poisoned]

eellison added a commit that referenced this pull request Aug 21, 2025

[WIP] more aggressive persistent reduction

129745d

ghstack-source-id: b8ebe30 Pull Request resolved: #161055

Update

17a5d4a

[ghstack-poisoned]

eellison added a commit that referenced this pull request Aug 21, 2025

[WIP] more aggressive persistent reduction

f6ba6a3

ghstack-source-id: 133adb1 Pull Request resolved: #161055

eellison requested review from PaulZhang12 and jansel August 21, 2025 16:13

jansel approved these changes Aug 22, 2025

View reviewed changes

eellison added a commit that referenced this pull request Aug 22, 2025

[WIP] more aggressive persistent reduction

f78e6e6

ghstack-source-id: 3f9988e Pull Request resolved: #161055

eellison added a commit that referenced this pull request Aug 22, 2025

[WIP] more aggressive persistent reduction

77126f3

ghstack-source-id: 5e03e09 Pull Request resolved: #161055

PaulZhang12 reviewed Aug 22, 2025

View reviewed changes

eellison added a commit that referenced this pull request Aug 29, 2025

[WIP] more aggressive persistent reduction

7959864

ghstack-source-id: b1ebb75 Pull Request resolved: #161055

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2025

pytorchmergebot added the merging label Aug 29, 2025

pytorchmergebot removed the merging label Aug 29, 2025

eellison added the topic: not user facing topic category label Aug 29, 2025

pytorchmergebot added the merging label Aug 29, 2025

pytorchmergebot removed the merging label Aug 29, 2025

eellison added a commit that referenced this pull request Aug 29, 2025

[WIP] more aggressive persistent reduction

dbdcbf7

ghstack-source-id: c0c8cac Pull Request resolved: #161055

pytorchmergebot added the merging label Aug 29, 2025

pytorchmergebot added the Merged label Aug 30, 2025

pytorchmergebot closed this in ebfee60 Aug 30, 2025

pytorchmergebot removed the merging label Aug 30, 2025

eellison changed the title ~~[WIP] more aggressive persistent reduction~~ more aggressive persistent reduction Sep 2, 2025

github-actions bot deleted the gh/eellison/820/head branch October 3, 2025 02:08

shunting314 reviewed Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

more aggressive persistent reduction #161055

more aggressive persistent reduction #161055

Uh oh!

eellison commented Aug 20, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 20, 2025 •

edited

Loading

Uh oh!

jansel left a comment

Uh oh!

PaulZhang12 left a comment

Uh oh!

eellison commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Uh oh!

eellison commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Uh oh!

eellison commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Uh oh!

shunting314 Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

more aggressive persistent reduction #161055

more aggressive persistent reduction #161055

Uh oh!

Conversation

eellison commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161055

❗ 1 Active SEVs

✅ No Failures

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

PaulZhang12 left a comment

Choose a reason for hiding this comment

Uh oh!

eellison commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge failed

Uh oh!

eellison commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge failed

Uh oh!

eellison commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge started

Uh oh!

shunting314 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eellison commented Aug 20, 2025 •

edited

Loading

pytorch-bot bot commented Aug 20, 2025 •

edited

Loading