[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

shunting314 · 2025-09-03T00:21:20Z

Stack from ghstack (oldest at bottom):

For the not able to fuse issue reported here: #93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-09-03T00:21:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162028

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3dcc3f9 with merge base 8171d60 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d8b951c Pull Request resolved: #162028

shunting314 · 2025-09-05T06:45:15Z

@pytorchbot merge

pytorchmergebot · 2025-09-05T06:47:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: #162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162028

Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: #162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162028, #162221

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: #162311 Approved by: https://github.com/jansel ghstack dependencies: #162028, #162221, #162303

…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison

Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028

Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303

…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison

Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028

Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303

…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison

Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028

Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303

…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison

Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028

Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303

…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison

Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028

Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221

LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303

[ez][inductor] add a few outer dimension reduction cases for LOAF

3dcc3f9

[ghstack-poisoned]

shunting314 added a commit that referenced this pull request Sep 3, 2025

[ez][inductor] add a few outer dimension reduction cases for LOAF

5a5b573

ghstack-source-id: d8b951c Pull Request resolved: #162028

pytorch-bot bot added ciflow/inductor module: inductor topic: not user facing topic category labels Sep 3, 2025

shunting314 requested review from eellison and jansel September 3, 2025 00:23

This was referenced Sep 3, 2025

[inductor] avoid creating LoopBody twice #162101

Closed

[inductor] turn on loaf (for oss) by default #162030

Closed

LOAF not for land hack #162102

Closed

jansel approved these changes Sep 3, 2025

View reviewed changes

eellison approved these changes Sep 3, 2025

View reviewed changes

This was referenced Sep 4, 2025

[Inductor] don't call sympy_str when not needed #162126

Closed

[inductor] fix TemplateBuffer.extract_read_writes #162221

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 5, 2025

pytorchmergebot added the merging label Sep 5, 2025

pytorchmergebot added the Merged label Sep 5, 2025

pytorchmergebot closed this in a714437 Sep 5, 2025

pytorchmergebot removed the merging label Sep 5, 2025

This was referenced Sep 5, 2025

[inductor] rename deps during refreshing #162303

Closed

[inductor] fuse for scalar shared data #162311

Closed

shunting314 mentioned this pull request Sep 6, 2025

[inductor] fix 3d tiled online softmax #162341

Closed

shunting314 mentioned this pull request Sep 8, 2025

[Inductor] do loop reordering in a separate final round #162355

Closed

github-actions bot deleted the gh/shunting314/214/head branch October 6, 2025 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

Uh oh!

shunting314 commented Sep 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

pytorchmergebot commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

[ez][inductor] add a few outer dimension reduction cases for LOAF #162028

Uh oh!

Conversation

shunting314 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162028

✅ No Failures

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

pytorchmergebot commented Sep 5, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shunting314 commented Sep 3, 2025 •

edited

Loading

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading