[inductor] don't fuse two nodes if likely increase peak memory #138756

shunting314 · 2024-10-23T22:26:58Z

Stack from ghstack (oldest at bottom):

Partially fixing #138685

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:

        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2024-10-23T22:27:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138756

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2c40d7e with merge base e6ff07f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ak memory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 843d49c Pull Request resolved: #138756

shunting314 · 2024-10-24T18:27:19Z

The perf of revision 6a7fa50 here is neutral. But I probably need change how we decide skipping fusion here. Filtering inputs with a single user seems too restrictive.

jansel

Failing tests

Does this impact compile time at all?

eellison

We have seen other cases of inductor increasing memory use, recently, see internal link. I would rather we do the full solution, i.e., with tensor liveness ranges, peak memory calculation, etc.

shunting314 · 2024-10-25T17:45:59Z

Does this impact compile time at all?

Compilation time looks fine:

I'll fix the tests.

shunting314 · 2024-10-25T17:49:51Z

I would rather we do the full solution, i.e., with tensor liveness ranges, peak memory calculation, etc.

I imagine the fully solution can be INCREMENTALLY built upon this one.

right now the check in this PR is very strict (but safe). We consider two inputs can be reused when they both have a single user. But this can be extended to check if the transitive user set of these inputs have overlapping. If not, there live-range can be non-overlapping and we can reuse the memory
The estimation of the current peak memory can be an parameter when we decide fusing or not.

…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 5c3f0e5 Pull Request resolved: #138756

jansel

Is that compile time regression CI real?

…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

laithsakka · 2024-10-29T17:12:47Z

Is that compile time regression CI real?
if you are taking about the pr_time benchamrks
its 1.5% not a huge jump but its real add_loop benchmarks are very stable.

2024-10-29T00:57:30.6097670Z REGRESSION: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 24428182108 is 1.51% higher than expected 24064639114 ±+1.50% if this is an expected regression, please update the expected results.
2024-10-29T00:57:30.6099173Z 
2024-10-29T00:57:30.6099662Z please update all results that changed significantly, and not only the failed ones
2024-10-29T00:57:30.6101303Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 40517874538 -1.16% is within expected 40992578178 ±2.50%
2024-10-29T00:57:30.6102305Z 
2024-10-29T00:57:30.6103789Z REGRESSION: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 23173449187 is 1.54% higher than expected 22822864522 ±+1.50% if this is an expected regression, please update the expected results.
2024-10-29T00:57:30.6105262Z

…mory" Partially fixing #138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]

shunting314 · 2024-11-04T18:34:14Z

@pytorchbot merge

pytorchmergebot · 2024-11-04T18:35:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

laithsakka · 2024-11-05T00:06:42Z

@shunting314 curios if the error went by itself after rebase or if you had to change something in the code.
trying to evaluate the effectiveness of this test vs how bothering it is and weather i shall keep it.

shunting314 · 2024-11-05T01:56:58Z

@laithsakka the error gone by itself after rebasing.

laithsakka · 2024-11-05T02:02:12Z

sounds good seems like the diff regressed two benchmarks by 1% which is less than threshold
I will put a pr though to increment expected value for those to avoid flakiness, I will add for
review.

laithsakka · 2024-11-05T02:05:16Z

PASS: benchmark ('add_loop_inductor', 'compile_time_instruction_count') pass, actual result 24603225123 +1.41% is within expected 24260000000 ±1.50%

PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 40744976754 +0.90% is within expected 40380000000 ±2.50%

PASS: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') pass, actual result 23331151962 +1.40% is within expected 23010000000 ±1.50%

laithsakka · 2024-11-05T02:06:21Z

#139703

see comments end of #138756 cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames [ghstack-poisoned]

see comments end of #138756 I am also refreshing all values Pull Request resolved: #139703 Approved by: https://github.com/bobrenjc93

To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: #138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #139136, #138756

see comments end of pytorch#138756 I am also refreshing all values Pull Request resolved: pytorch#139703 Approved by: https://github.com/bobrenjc93

To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g. ``` python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot ``` gives me: <img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f"> Pull Request resolved: pytorch#138429 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: pytorch#139136, pytorch#138756

[inductor] don't fuse two reductions if likely increase peak memory

4c9556f

[ghstack-poisoned]

shunting314 mentioned this pull request Oct 23, 2024

[Inductor] don't set XBLOCK larger than xnumel #138730

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Oct 23, 2024

shunting314 requested review from Chillee, eellison and jansel October 23, 2024 22:30

shunting314 changed the title ~~[inductor] don't fuse two reductions if likely increase peak memory~~ [inductor] don't fuse two nodes if likely increase peak memory Oct 23, 2024

shunting314 added a commit that referenced this pull request Oct 23, 2024

[inductor] don't fuse two nodes if likely increase peak memory

2ec327c

ghstack-source-id: 843d49c Pull Request resolved: #138756

This was referenced Oct 23, 2024

[inductor] collect memory snapshort in the wrapper #138429

Closed

[Inductor] auto-chunker #136702

Open

jansel requested changes Oct 25, 2024

View reviewed changes

eellison reviewed Oct 25, 2024

View reviewed changes

shunting314 added a commit that referenced this pull request Oct 25, 2024

[inductor] don't fuse two nodes if likely increase peak memory

a47a069

ghstack-source-id: 5c3f0e5 Pull Request resolved: #138756

shunting314 requested review from eellison and jansel October 25, 2024 21:34

jansel requested changes Oct 26, 2024

View reviewed changes

shunting314 mentioned this pull request Oct 29, 2024

[inductor] patterns to remove pointless view/permute pairs #139136

Closed

shunting314 added 2 commits November 4, 2024 00:01

shunting314 added the topic: not user facing topic category label Nov 4, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2024

pytorchmergebot added the merging label Nov 4, 2024

pytorchmergebot added the Merged label Nov 4, 2024

pytorchmergebot closed this in 8881108 Nov 4, 2024

pytorchmergebot removed the merging label Nov 4, 2024

laithsakka mentioned this pull request Nov 5, 2024

increase add_loop benchmark and refresh all results! #139703

Closed

pytorchmergebot pushed a commit that referenced this pull request Nov 5, 2024

increase add_loop benchmark and refresh all results! (#139703)

de4216b

see comments end of #138756 I am also refreshing all values Pull Request resolved: #139703 Approved by: https://github.com/bobrenjc93

github-actions bot deleted the gh/shunting314/180/head branch December 5, 2024 02:16

[inductor] don't fuse two nodes if likely increase peak memory #138756

[inductor] don't fuse two nodes if likely increase peak memory #138756

Uh oh!

Conversation

shunting314 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138756

✅ No Failures

Uh oh!

shunting314 commented Oct 24, 2024

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

shunting314 commented Oct 25, 2024

Uh oh!

shunting314 commented Oct 25, 2024

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

laithsakka commented Oct 29, 2024

Uh oh!

shunting314 commented Nov 4, 2024

Uh oh!

pytorchmergebot commented Nov 4, 2024

Merge started

Uh oh!

laithsakka commented Nov 5, 2024

Uh oh!

shunting314 commented Nov 5, 2024

Uh oh!

laithsakka commented Nov 5, 2024

Uh oh!

laithsakka commented Nov 5, 2024

Uh oh!

laithsakka commented Nov 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shunting314 commented Oct 23, 2024 •

edited

Loading

pytorch-bot bot commented Oct 23, 2024 •

edited

Loading