KEMBAR78
[ez][inductor] add a few outer dimension reduction cases for LOAF by shunting314 · Pull Request #162028 · pytorch/pytorch · GitHub
Skip to content

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Sep 3, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162028

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3dcc3f9 with merge base 8171d60 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Sep 6, 2025
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: #162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #162028
pytorchmergebot pushed a commit that referenced this pull request Sep 6, 2025
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: #162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #162028, #162221
pytorchmergebot pushed a commit that referenced this pull request Sep 7, 2025
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: #162311
Approved by: https://github.com/jansel
ghstack dependencies: #162028, #162221, #162303
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
…torch#162028)

For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: pytorch#162028
Approved by: https://github.com/jansel, https://github.com/eellison
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: pytorch#162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162028
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: pytorch#162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221
daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: pytorch#162311
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…torch#162028)

For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: pytorch#162028
Approved by: https://github.com/jansel, https://github.com/eellison
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: pytorch#162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162028
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: pytorch#162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: pytorch#162311
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…torch#162028)

For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: pytorch#162028
Approved by: https://github.com/jansel, https://github.com/eellison
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: pytorch#162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162028
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: pytorch#162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: pytorch#162311
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…torch#162028)

For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: pytorch#162028
Approved by: https://github.com/jansel, https://github.com/eellison
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: pytorch#162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162028
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: pytorch#162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: pytorch#162311
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…torch#162028)

For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.

Pull Request resolved: pytorch#162028
Approved by: https://github.com/jansel, https://github.com/eellison
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix.

Pull Request resolved: pytorch#162221
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: pytorch#162028
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Skiping renaming cause wrong dependencies when mutations are involved.

Test:

CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap

Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test.

Pull Request resolved: pytorch#162303
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
LOAF previously may skip these fusion opportunities and cause some tests fail.

Test:
- TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda

Pull Request resolved: pytorch#162311
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
@github-actions github-actions bot deleted the gh/shunting314/214/head branch October 6, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants