KEMBAR78
[inductor] benchmark fusion by shunting314 · Pull Request #108193 · pytorch/pytorch · GitHub
Skip to content

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Aug 29, 2023

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 29, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108193

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 43daabe with merge base b600aed (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Aug 30, 2023
ghstack-source-id: 8937919
Pull Request resolved: #108193
@shunting314
Copy link
Contributor Author

shunting314 commented Aug 30, 2023

benchmark fusion helps with huggingface link . Check those green cells representing speedup. A few things worth mention

  1. a few models fails, I'll need debug
  2. benchmark fusion will increase computation time. So I'll disable it by default.

As a important follow up, I'll apply this on the loop ordering PR and try to find some useful patterns

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Aug 30, 2023
ghstack-source-id: 09b5bd9
Pull Request resolved: #108193
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Aug 31, 2023
ghstack-source-id: 36137bb
Pull Request resolved: #108193
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov anijain2305

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@shunting314 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Oct 27, 2023
This reverts commit 73cc5d1.

Reverted #108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of #108690, please rebase and reland. ([comment](#108193 (comment)))
@facebook-github-bot facebook-github-bot deleted the gh/shunting314/78/head branch October 30, 2023 14:24
pytorchmergebot pushed a commit that referenced this pull request Oct 31, 2023
shunting314 added a commit that referenced this pull request Nov 1, 2023
This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).

Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc eellison  if there is easy way to make it work with cudagraphs.  But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.


Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```



[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Nov 1, 2023
This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).

Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc eellison  if there is easy way to make it work with cudagraphs.  But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.


Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```



[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Nov 2, 2023
This PR is spilt out of #108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).

Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison  if there is easy way to make it work with cudagraphs.  But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.

Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```

Pull Request resolved: #112091
Approved by: https://github.com/eellison, https://github.com/jansel
pytorchmergebot pushed a commit that referenced this pull request Nov 7, 2023
…re (#113039)

Recent work (#108193 and #109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases.

I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR:
https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563

The slightly performance improvement can be noise, if not, the lower register pressure could explain.

Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side.

I'm also seeing the fused kernel provided in #108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously.

Pull Request resolved: #113039
Approved by: https://github.com/shunting314
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
This reverts commit ec0cdcd.

Reverted pytorch#108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](pytorch#108193 (comment)))
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
This reverts commit 73cc5d1.

Reverted pytorch#108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of pytorch#108690, please rebase and reland. ([comment](pytorch#108193 (comment)))
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
This PR is spilt out of pytorch#108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).

Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison  if there is easy way to make it work with cudagraphs.  But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.

Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```

Pull Request resolved: pytorch#112091
Approved by: https://github.com/eellison, https://github.com/jansel
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…re (pytorch#113039)

Recent work (pytorch#108193 and pytorch#109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases.

I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR:
https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563

The slightly performance improvement can be noise, if not, the lower register pressure could explain.

Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side.

I'm also seeing the fused kernel provided in pytorch#108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously.

Pull Request resolved: pytorch#113039
Approved by: https://github.com/shunting314
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
This reverts commit ec0cdcd.

Reverted pytorch#108193 on behalf of https://github.com/ZainRizvi due to This test is breaking trunk. In the future please make sure to add the ciflow/trunk label before force merging any PR to ensure your code doesn't break those tests ([comment](pytorch#108193 (comment)))
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
This reverts commit 73cc5d1.

Reverted pytorch#108193 on behalf of https://github.com/izaitsevfb due to Trying to unblock the revert of pytorch#108690, please rebase and reland. ([comment](pytorch#108193 (comment)))
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
This PR is spilt out of pytorch#108193 . It adds the ability to add assertion after each triton kernel calls to make sure all tensor arguments are not nan/inf. It helps me find a few bugs when working on benchmark fusion (due to messing up some kernel/graph level states when generating kernel code).

Right now we have to disable cudagraphs to enable the nan/inf checks. Otherwise we will see errors like: https://gist.github.com/shunting314/053db66c4f121e5f4c5de159bf0032ed . My best guess is it's due to GPU->CPU copy during capturing for cudagraphs. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @eellison  if there is easy way to make it work with cudagraphs.  But even if the nan-checker is not compatible with cudagraphs, it's probably still fine since it's just for debugging purpose.

Test command:
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_NAN_ASSERTS=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training --disable-cudagraphs
```

Pull Request resolved: pytorch#112091
Approved by: https://github.com/eellison, https://github.com/jansel
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…re (pytorch#113039)

Recent work (pytorch#108193 and pytorch#109275) unveiled that bigger Triton kernel can regress performance due to increased register pressure which in turn lowers thread occupancy. By taking a look at the Triton internal, I see an opportunity to reduce the register pressure by decreasing the amount of work each thread does. I'm bumping up the `num_warps` to achieve this. The change should only affect reduction cases.

I'm seeing real compilation time reduction with this change which is likely due to smaller LLVM IR:
https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2023%20Oct%202023%2017%3A57%3A40%20GMT&stopTime=Mon%2C%2006%20Nov%202023%2018%3A57%3A40%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=hoy-reduction&lCommit=f2d31b83aa170914018407d88a76d5951153b316&rBranch=main&rCommit=64f326097be8ac66ff057365f3bed2d64c697563

The slightly performance improvement can be noise, if not, the lower register pressure could explain.

Ideally, we should improve Triton to automatically reroll large kernel to an inner loop, without hurting vectorization. That's something I'm considering on the LLVM side.

I'm also seeing the fused kernel provided in pytorch#108193 gets a better performance by benefiting from a lower register pressure. PTXAS shows a usage of 32 registers compared to 55 previously.

Pull Request resolved: pytorch#113039
Approved by: https://github.com/shunting314
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants