[inductor] benchmark fusion #108193

shunting314 · 2023-08-29T22:55:33Z

Stack from ghstack (oldest at bottom):

-> [inductor] benchmark fusion #108193

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @Xia-Weiwen @ngimel @anijain2305

[ghstack-poisoned]

pytorch-bot · 2023-08-29T22:55:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108193

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 43daabe with merge base b600aed ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

ghstack-source-id: 8937919 Pull Request resolved: #108193

shunting314 · 2023-08-30T19:18:15Z

benchmark fusion helps with huggingface link . Check those green cells representing speedup. A few things worth mention

a few models fails, I'll need debug
benchmark fusion will increase computation time. So I'll disable it by default.

As a important follow up, I'll apply this on the loop ordering PR and try to find some useful patterns

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

ghstack-source-id: 09b5bd9 Pull Request resolved: #108193

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

ghstack-source-id: 36137bb Pull Request resolved: #108193

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

pytorchmergebot · 2023-10-27T01:40:01Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-10-27T01:40:10Z

@shunting314 your PR has been successfully reverted.

This reverts commit 73cc5d1. Reverted #108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](#108193 (comment)))

reland #108193 Pull Request resolved: #112450 Approved by: https://github.com/jansel

This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` [ghstack-poisoned]

@voznesenskym

This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: #112091 Approved by: https://github.com/eellison, https://github.com/jansel

…re (#113039) Recent work (#108193 and #109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases. I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR: https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563 The slightly performance improvement can be noise, if not, the lower register pressure could explain. Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side. I'm also seeing the fused kernel provided in #108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously. Pull Request resolved: #113039 Approved by: https://github.com/shunting314

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

This reverts commit ec0cdcd. Reverted pytorch#108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](pytorch#108193 (comment)))

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

This reverts commit 73cc5d1. Reverted pytorch#108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of pytorch#108690, please rebase and reland. ([comment](pytorch#108193 (comment)))

reland pytorch#108193 Pull Request resolved: pytorch#112450 Approved by: https://github.com/jansel

@voznesenskym

This PR is spilt out of pytorch#108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: pytorch#112091 Approved by: https://github.com/eellison, https://github.com/jansel

…re (pytorch#113039) Recent work (pytorch#108193 and pytorch#109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases. I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR: https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563 The slightly performance improvement can be noise, if not, the lower register pressure could explain. Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side. I'm also seeing the fused kernel provided in pytorch#108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously. Pull Request resolved: pytorch#113039 Approved by: https://github.com/shunting314

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

This reverts commit ec0cdcd. Reverted pytorch#108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](pytorch#108193 (comment)))

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

This reverts commit 73cc5d1. Reverted pytorch#108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of pytorch#108690, please rebase and reland. ([comment](pytorch#108193 (comment)))

reland pytorch#108193 Pull Request resolved: pytorch#112450 Approved by: https://github.com/jansel

@voznesenskym

This PR is spilt out of pytorch#108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code). Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison if there is easy way to make it work with cudagraphs. But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose. Test command: ``` TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs ``` Pull Request resolved: pytorch#112091 Approved by: https://github.com/eellison, https://github.com/jansel

…re (pytorch#113039) Recent work (pytorch#108193 and pytorch#109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases. I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR: https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563 The slightly performance improvement can be noise, if not, the lower register pressure could explain. Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side. I'm also seeing the fused kernel provided in pytorch#108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously. Pull Request resolved: pytorch#113039 Approved by: https://github.com/shunting314

[inductor] benchmark fusion

3969af7

[ghstack-poisoned]

shunting314 mentioned this pull request Aug 29, 2023

[inductor] let codegen not rely on node order #107320

Closed

shunting314 mentioned this pull request Aug 29, 2023

[inductor] no-side-effect codegen #107617

Closed

github-actions bot added module: inductor ciflow/inductor labels Aug 29, 2023

Update on "[inductor] benchmark fusion"

d3caf77

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Update on "[inductor] benchmark fusion"

30216f7

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

github-actions bot added the module: dynamo label Aug 30, 2023

Update on "[inductor] benchmark fusion"

ecc85db

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Aug 30, 2023

[inductor] benchmark fusion

0de99de

ghstack-source-id: 8937919 Pull Request resolved: #108193

shunting314 requested review from Chillee, eellison and jansel August 30, 2023 19:20

Update on "[inductor] benchmark fusion"

b7c1fd2

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

Update on "[inductor] benchmark fusion"

2e045b9

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Aug 30, 2023

[inductor] benchmark fusion

df8e04b

ghstack-source-id: 09b5bd9 Pull Request resolved: #108193

Update on "[inductor] benchmark fusion"

59583e3

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

Update on "[inductor] benchmark fusion"

4120cdd

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

shunting314 added a commit that referenced this pull request Aug 31, 2023

[inductor] benchmark fusion

201f80a

ghstack-source-id: 36137bb Pull Request resolved: #108193

Update on "[inductor] benchmark fusion"

a7cfddf

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

Update on "[inductor] benchmark fusion"

45b155c

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305 [ghstack-poisoned]

alexmsettle mentioned this pull request Oct 27, 2023

Insert nvtx markers into generated triton kernels #103644

Closed

facebook-github-bot deleted the gh/shunting314/78/head branch October 30, 2023 14:24

shunting314 mentioned this pull request Oct 30, 2023

[reland][inductor] benchmark fusion #112450

Closed

pytorchmergebot pushed a commit that referenced this pull request Oct 31, 2023

[reland][inductor] benchmark fusion (#112450)

fbafff3

reland #108193 Pull Request resolved: #112450 Approved by: https://github.com/jansel

htyu mentioned this pull request Nov 6, 2023

[inductor] scale up num_warps for reductions to lower register pressure #113039

Closed

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023

[inductor] benchmark fusion (pytorch#108193)

5a11d5e

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023

[inductor] benchmark fusion (pytorch#108193)

90cf137

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023

[reland][inductor] benchmark fusion (pytorch#112450)

5b27ee4

reland pytorch#108193 Pull Request resolved: pytorch#112450 Approved by: https://github.com/jansel

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023

[inductor] benchmark fusion (pytorch#108193)

df1cad5

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023

[inductor] benchmark fusion (pytorch#108193)

bdc7aa8

Pull Request resolved: pytorch#108193 Approved by: https://github.com/jansel

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023

[reland][inductor] benchmark fusion (pytorch#112450)

527bcc0

reland pytorch#108193 Pull Request resolved: pytorch#112450 Approved by: https://github.com/jansel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor] benchmark fusion #108193

[inductor] benchmark fusion #108193

Uh oh!

shunting314 commented Aug 29, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 29, 2023 •

edited

Loading

Uh oh!

shunting314 commented Aug 30, 2023 •

edited

Loading

Uh oh!

pytorchmergebot commented Oct 27, 2023

Uh oh!

pytorchmergebot commented Oct 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

[inductor] benchmark fusion #108193

[inductor] benchmark fusion #108193

Uh oh!

Conversation

shunting314 commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108193

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

shunting314 commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Oct 27, 2023

Uh oh!

pytorchmergebot commented Oct 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

shunting314 commented Aug 29, 2023 •

edited

Loading

pytorch-bot bot commented Aug 29, 2023 •

edited

Loading

shunting314 commented Aug 30, 2023 •

edited

Loading