[pytorch] Add decomp rule for scaled_dot_product_attention #108180

larryliu0820 · 2023-08-29T19:55:09Z

Stack from ghstack (oldest at bottom):

-> [pytorch] Add decomp rule for scaled_dot_product_attention #108180

scaled_dot_product_attention used to be decomposed in pre-autograd, given that it calls _scaled_dot_product_attention_math and _scaled_dot_product_attention_math only has a CompositeImplicitAutograd kernel. As a result it's decomposed into ops with finer granularity.

However recent PRs (#103826 #105131) added new logic in scaled_dot_product_attention and now it calls _scaled_dot_product_flash_attention which contains a CPU kernel. This results in _scaled_dot_product_flash_attention showing up in torch.export(). This PR adds a decomposition that ensures scaled_dot_product_attention is still being decomposed the same way as before, i.e., going through _scaled_dot_product_attention_math. Notice that this decomp rule should be excluded by inductor.

Differential Revision: D48762000

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) [ghstack-poisoned]

SherlockNoMad

Thanks a lot for the fix.

pytorch-bot · 2023-08-29T20:34:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108180

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit fcadac3 with merge base 95cacb7 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

larryliu0820 · 2023-08-30T00:01:53Z

@pytorchbot merge

pytorchmergebot · 2023-08-30T00:03:28Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

larryliu0820 · 2023-08-30T00:04:40Z

@pytorchbot merge

pytorchmergebot · 2023-08-30T00:06:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-30T00:21:32Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Pull Request resolved: #108180 `scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. ghstack-source-id: 199140502 @exported-using-ghexport Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Pull Request resolved: #108180 `scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. ghstack-source-id: 199155539 @exported-using-ghexport Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

larryliu0820 · 2023-08-30T06:26:38Z

@pytorchbot merge

pytorchmergebot · 2023-08-30T06:29:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-30T06:39:25Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu)

Details for Dev Infra team

Raised by workflow job

larryliu0820 · 2023-08-30T15:49:56Z

@pytorchbot merge -f "failure seems unrelated"

pytorchmergebot · 2023-08-30T15:52:02Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions bot added module: inductor ciflow/inductor labels Aug 29, 2023

larryliu0820 requested review from SherlockNoMad, angelayi and ezyang August 29, 2023 19:56

SherlockNoMad approved these changes Aug 29, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 30, 2023

pytorchmergebot added the merging label Aug 30, 2023

pytorchmergebot removed the merging label Aug 30, 2023

larryliu0820 added the topic: not user facing topic category label Aug 30, 2023

pytorchmergebot added the merging label Aug 30, 2023

pytorchmergebot removed the merging label Aug 30, 2023

pytorchmergebot added the merging label Aug 30, 2023

pytorchmergebot removed the merging label Aug 30, 2023

pytorchmergebot added the merging label Aug 30, 2023

pytorchmergebot added Merged and removed merging labels Aug 30, 2023

pytorchmergebot closed this in 0fb1c05 Aug 30, 2023

facebook-github-bot deleted the gh/larryliu0820/43/head branch September 3, 2023 14:24

[pytorch] Add decomp rule for scaled_dot_product_attention #108180

[pytorch] Add decomp rule for scaled_dot_product_attention #108180

Uh oh!

Conversation

larryliu0820 commented Aug 29, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SherlockNoMad left a comment

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Aug 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108180

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

larryliu0820 commented Aug 30, 2023

Uh oh!

pytorchmergebot commented Aug 30, 2023

Merge failed

Uh oh!

larryliu0820 commented Aug 30, 2023

Uh oh!

pytorchmergebot commented Aug 30, 2023

Merge started

Uh oh!

pytorchmergebot commented Aug 30, 2023

Merge failed

Uh oh!

larryliu0820 commented Aug 30, 2023

Uh oh!

pytorchmergebot commented Aug 30, 2023

Merge started

Uh oh!

pytorchmergebot commented Aug 30, 2023

Merge failed

Uh oh!

larryliu0820 commented Aug 30, 2023

Uh oh!

pytorchmergebot commented Aug 30, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larryliu0820 commented Aug 29, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 29, 2023 •

edited

Loading