-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ez][inductor] add a few outer dimension reduction cases for LOAF #162028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162028
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3dcc3f9 with merge base 8171d60 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: #162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #162028
Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: #162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #162028, #162221
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: #162311 Approved by: https://github.com/jansel ghstack dependencies: #162028, #162221, #162303
…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028
Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028
Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028
Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028
Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
…torch#162028) For the not able to fuse issue reported here: pytorch#93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue. Pull Request resolved: pytorch#162028 Approved by: https://github.com/jansel, https://github.com/eellison
Make sure TemplateBuffer & ComputedBuffer have the same dependencies prefix. Pull Request resolved: pytorch#162221 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: pytorch#162028
Skiping renaming cause wrong dependencies when mutations are involved. Test: CUDA_VISIBLE_DEVICES=4,5,6 TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/distributed/test_compute_comm_reordering.py TestComputeCommReorderingMultiProc.test_reorder_compute_for_overlap Both all-reduce and wait-tensor ir node contains a MutationBuffer for this test. Pull Request resolved: pytorch#162303 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221
LOAF previously may skip these fusion opportunities and cause some tests fail. Test: - TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1 python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size4_num_block_pointers_1_num_triton_kernels_1_reduction_op4_cuda Pull Request resolved: pytorch#162311 Approved by: https://github.com/jansel ghstack dependencies: pytorch#162028, pytorch#162221, pytorch#162303
Stack from ghstack (oldest at bottom):
For the not able to fuse issue reported here: #93718 , LOAF can fuse the outer dimension softmax into a single kernel and brings 1.87x speedup for the example shape mentioned in the issue.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben