[FlexAttention] Fix max-autotune bug with captured buffer grads #141531

drisspg · 2024-11-26T01:00:53Z

Stack from ghstack (oldest at bottom):

-> [FlexAttention] Fix max-autotune bug with captured buffer grads #141531

Summary

Fix tensor argument ordering for autotuning flex attention, change how we enabled scatters codegen for triton. We used to go through the existing store_output triton codegen but now we just short circuit and generate the correct expression earlier on.

This enables us to instead of relying on arg.python_defs to thread arguments through via input_buffers we can instead reuse the exact same mutated buffer infra as we did for multiple outputs before.

Test cases added for both default and max-autotune-no-cudagraphs modes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @Chillee @yanboliang @BoyuanFeng

[ghstack-poisoned]

pytorch-bot · 2024-11-26T01:00:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141531

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit fb56ad9 with merge base f472b3a ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#141498)
convnext_base

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: bb55575 Pull Request resolved: #141531

[ghstack-poisoned]

ghstack-source-id: ccbfb77 Pull Request resolved: #141531

drisspg · 2024-11-27T04:35:41Z

The problem here is that we are passing in the node an input That in the non-autotune case ends up being the last buffer argument while generating the args in max-autotune it ends up being the second to last because we have an explicit output node. I need to figure out how to reorder the output nodes from the kernel.

Confirmed by doing a hacky swap in in autotune process:

        input_tensors = list(input_tensors)
        tmp = input_tensors[-1]
        input_tensors[-1] = output_tensor
        output_tensor = tmp

[ghstack-poisoned]

ghstack-source-id: 618cc7a Pull Request resolved: #141531

[ghstack-poisoned]

ghstack-source-id: 5715e6c Pull Request resolved: #141531

[ghstack-poisoned]

ghstack-source-id: ad1374d Pull Request resolved: #141531

[ghstack-poisoned]

ghstack-source-id: 129671c Pull Request resolved: #141531

[ghstack-poisoned]

ghstack-source-id: b9453fb Pull Request resolved: #141531

Chillee

Could we add a couple more tests? Specifically I'd like a test with multiple captured grads.

drisspg · 2024-12-04T04:42:18Z

Yeah, we have this test for default compile, but can add for autotuning

[ghstack-poisoned]

ghstack-source-id: 28c8abd Pull Request resolved: #141531

drisspg · 2024-12-04T13:40:16Z

@pytorchbot merge

pytorchmergebot · 2024-12-04T13:42:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-04T13:42:33Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, lf.linux.4xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

drisspg · 2024-12-04T13:51:13Z

@pytorchbot merge

pytorchmergebot · 2024-12-04T13:53:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#141531) # Summary Fix tensor argument ordering for autotuning flex attention, change how we enabled scatters codegen for triton. We used to go through the existing store_output triton codegen but now we just short circuit and generate the correct expression earlier on. This enables us to instead of relying on arg.python_defs to thread arguments through via input_buffers we can instead reuse the exact same mutated buffer infra as we did for multiple outputs before. Test cases added for both default and max-autotune-no-cudagraphs modes. Pull Request resolved: pytorch#141531 Approved by: https://github.com/Chillee

Update

835d555

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Nov 26, 2024

drisspg added a commit that referenced this pull request Nov 26, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

9d4bf72

ghstack-source-id: bb55575 Pull Request resolved: #141531

drisspg requested a review from Chillee November 26, 2024 01:01

Chillee approved these changes Nov 26, 2024

View reviewed changes

drisspg added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request module: flex attention labels Nov 26, 2024

Update

c1d8a20

[ghstack-poisoned]

drisspg mentioned this pull request Nov 27, 2024

[FlexAttention] Remove failing num_warps=8 in bwds #141653

Closed

drisspg added a commit that referenced this pull request Nov 27, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

50e780f

ghstack-source-id: ccbfb77 Pull Request resolved: #141531

Update

0dd37a0

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Nov 27, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

d6c786e

ghstack-source-id: 618cc7a Pull Request resolved: #141531

Update

c46d506

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Dec 3, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

5e00557

ghstack-source-id: 5715e6c Pull Request resolved: #141531

Update

14dd20e

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Dec 3, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

e113008

ghstack-source-id: ad1374d Pull Request resolved: #141531

Update

4b7a40c

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Dec 4, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

a551cba

ghstack-source-id: 129671c Pull Request resolved: #141531

Update

2b207d6

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Dec 4, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

6ea2e2d

ghstack-source-id: b9453fb Pull Request resolved: #141531

Chillee approved these changes Dec 4, 2024

View reviewed changes

Update

fb56ad9

[ghstack-poisoned]

drisspg added a commit that referenced this pull request Dec 4, 2024

[FlexAttention] Fix max-autotune bug with captured buffer grads

be10a80

ghstack-source-id: 28c8abd Pull Request resolved: #141531

pytorchmergebot added the merging label Dec 4, 2024

pytorchmergebot removed the merging label Dec 4, 2024

pytorchmergebot added the merging label Dec 4, 2024

pytorchmergebot added the Merged label Dec 4, 2024

pytorchmergebot closed this in 7830c21 Dec 4, 2024

pytorchmergebot removed the merging label Dec 4, 2024

github-actions bot deleted the gh/drisspg/87/head branch January 4, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FlexAttention] Fix max-autotune bug with captured buffer grads #141531

[FlexAttention] Fix max-autotune bug with captured buffer grads #141531

Uh oh!

drisspg commented Nov 26, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

Uh oh!

drisspg commented Nov 27, 2024 •

edited

Loading

Uh oh!

Chillee left a comment

Uh oh!

drisspg commented Dec 4, 2024

Uh oh!

drisspg commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

drisspg commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FlexAttention] Fix max-autotune bug with captured buffer grads #141531

[FlexAttention] Fix max-autotune bug with captured buffer grads #141531

Uh oh!

Conversation

drisspg commented Nov 26, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141531

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

drisspg commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chillee left a comment

Choose a reason for hiding this comment

Uh oh!

drisspg commented Dec 4, 2024

Uh oh!

drisspg commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge failed

Uh oh!

drisspg commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drisspg commented Nov 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

drisspg commented Nov 27, 2024 •

edited

Loading