[inductor] let codegen not rely on node order #107320

shunting314 · 2023-08-16T19:05:06Z

Stack from ghstack (oldest at bottom):

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (fuse_nodes_once) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done).

Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a):

  benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed

Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

[ghstack-poisoned]

pytorch-bot · 2023-08-16T19:05:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107320

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 067f47c with merge base 138e289 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shunting314 · 2023-08-16T20:32:54Z

See a lot of CI failures. Convert to draft for now. Will change it back once I fix them

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: a7f85f6 Pull Request resolved: #107320

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 1d15a88 Pull Request resolved: #107320

shunting314 · 2023-08-17T23:10:53Z

Test failures fixed. Ready for review now.

torch/_inductor/scheduler.py

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 0a3f081 Pull Request resolved: #107320

torch/_inductor/config.py

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: d9b41b2 Pull Request resolved: #107320

torch/_inductor/scheduler.py

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 747f3c0 Pull Request resolved: #107320

torch/_inductor/scheduler.py

torch/_inductor/codegen/common.py

torch/_inductor/scheduler.py

torch/_inductor/codegen/triton.py

torch/_inductor/scheduler.py

jansel · 2023-08-25T18:07:44Z

Feel free to re-request review once peters comments are addressed

We'd like to benchmark fusion (either for autotuning or for gathering data to find some patterns that can guide optimizations). There is a deadlock here that prevents us from doing this: to benchmark fusion, we need do codegen before all the fusions are done. However currently codegen rely on xSchedulerNode.last_usage information to decide which buffers are not needed at all and thus don't even need to be allocated/written (Scheduler.removed_buffers tracks this). xSchedulerNode.last_usage information can only be computed once the order of all the nodes have been decided. But each fusion pass (`fuse_nodes_once`) can also change node orders. So we know the final node orders only after all the fusions have completed. That blocks us from doing codegen during fusion (before all fusion are done). Here I just show the above with a chain of dependencies to make it easier to understand (a -> b means a depends on b, or b has to happen before a): ``` benchmark one fusion decision -> codegen -> xSchedulerNode.last_usage -> node order -> all fusions have completed ``` Actually we only need to decide if a buffer has only local usages (if yes, it's a candidate for removing). This can be decided if we know what are all the users for each buffer. We can avoid using xSchedulerNode.last_usage in this case. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 9d54bcb Pull Request resolved: #107320

shunting314 · 2023-08-29T22:18:02Z

@pytorchbot merge

pytorchmergebot · 2023-08-29T22:21:17Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

shunting314 · 2023-08-29T22:23:24Z

@pytorchbot label "topic: not user facing"

shunting314 · 2023-08-29T22:23:44Z

@pytorchbot merge

pytorchmergebot · 2023-08-29T22:25:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This reverts commit 556bfe7. [ghstack-poisoned]

This reverts commit 556bfe7. ghstack-source-id: 71c3a3f Pull Request resolved: #109466

[inductor] let codegen not rely on node order

b89cd30

[ghstack-poisoned]

github-actions bot added module: inductor ciflow/inductor labels Aug 16, 2023

shunting314 requested review from Chillee, eellison and jansel August 16, 2023 19:19

shunting314 marked this pull request as draft August 16, 2023 20:32

shunting314 added a commit that referenced this pull request Aug 16, 2023

[inductor] let codegen not rely on node order

fd44b3c

ghstack-source-id: a7f85f6 Pull Request resolved: #107320

shunting314 added a commit that referenced this pull request Aug 17, 2023

[inductor] let codegen not rely on node order

00794dc

ghstack-source-id: 1d15a88 Pull Request resolved: #107320

shunting314 marked this pull request as ready for review August 17, 2023 23:10

peterbell10 reviewed Aug 18, 2023

View reviewed changes

torch/_inductor/scheduler.py Outdated Show resolved Hide resolved

shunting314 mentioned this pull request Aug 21, 2023

[inductor] no-side-effect codegen #107617

Closed

shunting314 added a commit that referenced this pull request Aug 23, 2023

[inductor] let codegen not rely on node order

5db8ae4

ghstack-source-id: 0a3f081 Pull Request resolved: #107320

jansel reviewed Aug 23, 2023

View reviewed changes

torch/_inductor/config.py Show resolved Hide resolved

shunting314 added a commit that referenced this pull request Aug 24, 2023

[inductor] let codegen not rely on node order

b3d2e15

ghstack-source-id: d9b41b2 Pull Request resolved: #107320

shunting314 commented Aug 24, 2023

View reviewed changes

torch/_inductor/scheduler.py Show resolved Hide resolved

shunting314 requested review from jansel and peterbell10 August 24, 2023 22:50

shunting314 added a commit that referenced this pull request Aug 25, 2023

[inductor] let codegen not rely on node order

c097782

ghstack-source-id: 747f3c0 Pull Request resolved: #107320

peterbell10 requested changes Aug 25, 2023

View reviewed changes

torch/_inductor/scheduler.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/common.py Outdated Show resolved Hide resolved

torch/_inductor/scheduler.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/triton.py Outdated Show resolved Hide resolved

shunting314 commented Aug 25, 2023

View reviewed changes

torch/_inductor/scheduler.py Outdated Show resolved Hide resolved

jansel removed their request for review August 25, 2023 18:07

shunting314 added 2 commits August 29, 2023 11:09

shunting314 added a commit that referenced this pull request Aug 29, 2023

[inductor] let codegen not rely on node order

864f417

ghstack-source-id: 9d54bcb Pull Request resolved: #107320

shunting314 requested review from jansel and peterbell10 August 29, 2023 19:02

jansel approved these changes Aug 29, 2023

View reviewed changes

peterbell10 approved these changes Aug 29, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2023

pytorchmergebot added the merging label Aug 29, 2023

pytorchmergebot removed the merging label Aug 29, 2023

pytorch-bot bot added the topic: not user facing topic category label Aug 29, 2023

pytorchmergebot added the merging label Aug 29, 2023

shunting314 mentioned this pull request Aug 29, 2023

[inductor] benchmark fusion #108193

Closed

pytorchmergebot added Merged and removed merging labels Aug 30, 2023

pytorchmergebot closed this in 556bfe7 Aug 30, 2023

ezyang added a commit that referenced this pull request Sep 17, 2023

Revert "[inductor] let codegen not rely on node order (#107320)"

4948bbc

This reverts commit 556bfe7. [ghstack-poisoned]

ezyang added a commit that referenced this pull request Sep 17, 2023

Revert "[inductor] let codegen not rely on node order (#107320)"

b7f6faa

This reverts commit 556bfe7. ghstack-source-id: 71c3a3f Pull Request resolved: #109466

facebook-github-bot deleted the gh/shunting314/72/head branch September 29, 2023 14:24

[inductor] let codegen not rely on node order #107320

[inductor] let codegen not rely on node order #107320

Uh oh!

Conversation

shunting314 commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107320

✅ No Failures

Uh oh!

shunting314 commented Aug 16, 2023

Uh oh!

shunting314 commented Aug 17, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jansel commented Aug 25, 2023

Uh oh!

shunting314 commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Merge failed

Uh oh!

shunting314 commented Aug 29, 2023

Uh oh!

shunting314 commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shunting314 commented Aug 16, 2023 •

edited

Loading

pytorch-bot bot commented Aug 16, 2023 •

edited

Loading