-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[inductor] benchmark fusion #108193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[inductor] benchmark fusion #108193
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108193
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 43daabe with merge base b600aed ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
|
benchmark fusion helps with huggingface link . Check those green cells representing speedup. A few things worth mention
As a important follow up, I'll apply this on the loop ordering PR and try to find some useful patterns |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@shunting314 your PR has been successfully reverted. |
This reverts commit 73cc5d1. Reverted #108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](#108193 (comment)))
reland #108193 Pull Request resolved: #112450 Approved by: https://github.com/jansel
This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` [ghstack-poisoned]
This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` [ghstack-poisoned]
This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: #112091 Approved by: https://github.com/eellison, https://github.com/jansel
…re (#113039) Recent work (#108193 and #109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases. I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR: https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563 The slightly performance improvement can be noise, if not, the lower register pressure could explain. Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side. I'm also seeing the fused kernel provided in #108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously. Pull Request resolved: #113039 Approved by: https://github.com/shunting314
Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel
This reverts commit ec0cdcd. Reverted pytorch#108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](pytorch#108193 (comment)))
Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel
This reverts commit 73cc5d1. Reverted pytorch#108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of pytorch#108690, please rebase and reland. ([comment](pytorch#108193 (comment)))
reland pytorch#108193 Pull Request resolved: pytorch#112450 Approved by: https://github.com/jansel
This PR is spilt out of pytorch#108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: pytorch#112091 Approved by: https://github.com/eellison, https://github.com/jansel
…re (pytorch#113039) Recent work (pytorch#108193 and pytorch#109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases. I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR: https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563 The slightly performance improvement can be noise, if not, the lower register pressure could explain. Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side. I'm also seeing the fused kernel provided in pytorch#108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously. Pull Request resolved: pytorch#113039 Approved by: https://github.com/shunting314
Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel
This reverts commit ec0cdcd. Reverted pytorch#108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](pytorch#108193 (comment)))
Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel
This reverts commit 73cc5d1. Reverted pytorch#108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of pytorch#108690, please rebase and reland. ([comment](pytorch#108193 (comment)))
reland pytorch#108193 Pull Request resolved: pytorch#112450 Approved by: https://github.com/jansel
This PR is spilt out of pytorch#108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: pytorch#112091 Approved by: https://github.com/eellison, https://github.com/jansel
…re (pytorch#113039) Recent work (pytorch#108193 and pytorch#109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases. I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR: https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563 The slightly performance improvement can be noise, if not, the lower register pressure could explain. Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side. I'm also seeing the fused kernel provided in pytorch#108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously. Pull Request resolved: pytorch#113039 Approved by: https://github.com/shunting314
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @Xia-Weiwen @ngimel @anijain2305