KEMBAR78
[inductor] don't fuse two nodes if likely increase peak memory by shunting314 · Pull Request #138756 · pytorch/pytorch · GitHub
Skip to content

Conversation

@shunting314
Copy link
Contributor

@shunting314 shunting314 commented Oct 23, 2024

Stack from ghstack (oldest at bottom):

Partially fixing #138685

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:

        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138756

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2c40d7e with merge base e6ff07f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ak memory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@shunting314 shunting314 changed the title [inductor] don't fuse two reductions if likely increase peak memory [inductor] don't fuse two nodes if likely increase peak memory Oct 23, 2024
…mory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 23, 2024
@shunting314
Copy link
Contributor Author

The perf of revision 6a7fa50 here is neutral. But I probably need change how we decide skipping fusion here. Filtering inputs with a single user seems too restrictive.

Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing tests

Does this impact compile time at all?

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have seen other cases of inductor increasing memory use, recently, see internal link. I would rather we do the full solution, i.e., with tensor liveness ranges, peak memory calculation, etc.

@shunting314
Copy link
Contributor Author

Does this impact compile time at all?

Compilation time looks fine:
Screenshot 2024-10-25 at 10 44 18 AM

I'll fix the tests.

@shunting314
Copy link
Contributor Author

I would rather we do the full solution, i.e., with tensor liveness ranges, peak memory calculation, etc.

I imagine the fully solution can be INCREMENTALLY built upon this one.

  1. right now the check in this PR is very strict (but safe). We consider two inputs can be reused when they both have a single user. But this can be extended to check if the transitive user set of these inputs have overlapping. If not, there live-range can be non-overlapping and we can reuse the memory
  2. The estimation of the current peak memory can be an parameter when we decide fusing or not.

…mory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
shunting314 added a commit that referenced this pull request Oct 25, 2024
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that compile time regression CI real?

…mory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@laithsakka
Copy link
Contributor

Is that compile time regression CI real?
if you are taking about the pr_time benchamrks
its 1.5% not a huge jump but its real add_loop benchmarks are very stable.

2024-10-29T00:57:30.6097670Z REGRESSION: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 24428182108 is 1.51% higher than expected 24064639114 ±+1.50% if this is an expected regression, please update the expected results.
2024-10-29T00:57:30.6099173Z 
2024-10-29T00:57:30.6099662Z please update all results that changed significantly, and not only the failed ones
2024-10-29T00:57:30.6101303Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 40517874538 -1.16% is within expected 40992578178 ±2.50%
2024-10-29T00:57:30.6102305Z 
2024-10-29T00:57:30.6103789Z REGRESSION: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 23173449187 is 1.54% higher than expected 22822864522 ±+1.50% if this is an expected regression, please update the expected results.
2024-10-29T00:57:30.6105262Z 

…mory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mory"


Partially fixing #138685 

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@shunting314 shunting314 added the topic: not user facing topic category label Nov 4, 2024
@shunting314
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@laithsakka
Copy link
Contributor

@shunting314 curios if the error went by itself after rebase or if you had to change something in the code.
trying to evaluate the effectiveness of this test vs how bothering it is and weather i shall keep it.

@shunting314
Copy link
Contributor Author

@laithsakka the error gone by itself after rebasing.

@laithsakka
Copy link
Contributor

sounds good seems like the diff regressed two benchmarks by 1% which is less than threshold
I will put a pr though to increment expected value for those to avoid flakiness, I will add for
review.

Screenshot 2024-11-04 at 6 00 44 PM Screenshot 2024-11-04 at 6 00 11 PM

@laithsakka
Copy link
Contributor

PASS: benchmark ('add_loop_inductor', 'compile_time_instruction_count') pass, actual result 24603225123 +1.41% is within expected 24260000000 ±1.50%

PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 40744976754 +0.90% is within expected 40380000000 ±2.50%

PASS: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') pass, actual result 23331151962 +1.40% is within expected 23010000000 ±1.50%

@laithsakka
Copy link
Contributor

#139703

laithsakka added a commit that referenced this pull request Nov 5, 2024
see comments end of #138756


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Nov 5, 2024
see comments end of #138756
I am also refreshing all values

Pull Request resolved: #139703
Approved by: https://github.com/bobrenjc93
pytorchmergebot pushed a commit that referenced this pull request Nov 6, 2024
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g.
```
python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot
```
gives me:

<img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f">

Pull Request resolved: #138429
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #139136, #138756
@github-actions github-actions bot deleted the gh/shunting314/180/head branch December 5, 2024 02:16
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
see comments end of pytorch#138756
I am also refreshing all values

Pull Request resolved: pytorch#139703
Approved by: https://github.com/bobrenjc93
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g.
```
python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot
```
gives me:

<img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f">

Pull Request resolved: pytorch#138429
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: pytorch#139136, pytorch#138756
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants