[CPU] Support GQA for flash attention #157893

Valentine233 · 2025-07-09T03:02:37Z

As many models require GQA, we support it in flash attention for CPU path.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

pytorch-bot · 2025-07-09T03:02:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157893

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 289b47c with merge base b146ca7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mingfeima

genernally OK, just simplify the test cases a little bit to remove the duplicated code.

mingfeima · 2025-07-10T01:41:59Z

test/test_transformers.py

+    @parametrize("dtype", [torch.float64, torch.float32, torch.bfloat16, torch.float16])
+    @parametrize("n_heads", [[65, 5], [16, 4], [27, 1], [5, 1]])
+    @parametrize("train", [False, True])
+    def test_scaled_dot_product_fused_attention_gqa_vs_math_cpu(


combine this one with test_scaled_dot_product_fused_attention_mask_vs_math_cpu to remove duplicated code.

### impls def test_sdpa_vs_math_cpu_helper(...) def test_scaled_dot_product_fused_attention_mask_vs_math_cpu() test_sdpa_vs_math_cpu_helper(...) def test_scaled_dot_product_fused_attention_gqa_vs_math_cpu() test_sdpa_vs_math_cpu_helper(...)

Thanks, UT updated.

Valentine233 · 2025-07-13T09:41:33Z

@pytorchbot merge

pytorchmergebot · 2025-07-13T09:43:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

As many models require GQA, we support it in flash attention for CPU path. Approved by: https://github.com/mingfeima, https://github.com/jansel [ghstack-poisoned]

Summary: For `scaled_dot_product_attention(..., enable_gqa=True)`: - the Math backend passes the flag through, performing the extra [KV broadcast](https://github.com/pytorch/pytorch/blob/6e07d6a0ff386d99d8c2f1d25978b0683988a4cb/aten/src/ATen/native/transformers/attention.cpp#L902) if set to True - the Flash backend has no flag, and relies on correct indexing in the C++ kernel - Export used to default to Math for `enable_gqa=True`, but #157893 landed and enabled Flash. At the same time, there's an export-only [decomp](https://github.com/pytorch/pytorch/blob/6e07d6a0ff386d99d8c2f1d25978b0683988a4cb/torch/_decomp/decompositions.py#L4968) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs. This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes. Test Plan: test_export Rollback Plan: Differential Revision: D78524147

Summary: For `scaled_dot_product_attention(..., enable_gqa=True)`: - the Math backend passes the flag through, performing the extra [KV broadcast](https://github.com/pytorch/pytorch/blob/6e07d6a0ff386d99d8c2f1d25978b0683988a4cb/aten/src/ATen/native/transformers/attention.cpp#L902) if set to True - the Flash backend has no flag, and relies on correct indexing in the C++ kernel - Export used to default to Math for `enable_gqa=True`, but #157893 landed and enabled Flash. At the same time, there's an export-only [decomp](https://github.com/pytorch/pytorch/blob/6e07d6a0ff386d99d8c2f1d25978b0683988a4cb/torch/_decomp/decompositions.py#L4968) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs. This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes. Test Plan: test_export Rollback Plan: Reviewed By: angelayi Differential Revision: D78524147

Differential Revision: D78524147 For `scaled_dot_product_attention(..., enable_gqa=True)`: - the Math backend passes the flag through, performing the extra [KV broadcast](https://github.com/pytorch/pytorch/blob/6e07d6a0ff386d99d8c2f1d25978b0683988a4cb/aten/src/ATen/native/transformers/attention.cpp#L902) if set to True - the Flash backend has no flag, and relies on correct indexing in the C++ kernel - Export used to default to Math for `enable_gqa=True`, but #157893 landed and enabled Flash. At the same time, there's an export-only [decomp](https://github.com/pytorch/pytorch/blob/6e07d6a0ff386d99d8c2f1d25978b0683988a4cb/torch/_decomp/decompositions.py#L4968) redirecting flash -> math, calling with `enable_gqa` unset, because that info isn't available. This led to https://fb.workplace.com/groups/1028545332188949/posts/1264609398582540 crashing, calling the Math non-GQA variant, with GQA inputs. This assumes GQA for seqlen mismatches in the export decomp, setting `enable_gqa = <q seqlen> != <kv seqlen>`, relying on prior backend checks to raise on invalid input shapes. Pull Request resolved: #158604 Approved by: https://github.com/angelayi, https://github.com/drisspg

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 9, 2025

Valentine233 added the topic: not user facing topic category label Jul 9, 2025

pytorchbot added the open source label Jul 9, 2025

Valentine233 marked this pull request as draft July 9, 2025 05:31

Valentine233 requested a review from mingfeima July 9, 2025 06:59

Valentine233 marked this pull request as ready for review July 9, 2025 08:24

mingfeima approved these changes Jul 10, 2025

View reviewed changes

Valentine233 added 4 commits July 10, 2025 01:22

[CPU] Support GQA for flash attention

de6b870

fix CI

7c52018

fix CI

e9eddae

refine ut

289b47c

Valentine233 force-pushed the support_gqa_cpu branch from b7aa830 to 289b47c Compare July 10, 2025 05:23

Valentine233 added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 10, 2025

Valentine233 requested review from drisspg, jainapurva and jansel July 10, 2025 05:24

jansel approved these changes Jul 12, 2025

View reviewed changes

pytorchmergebot added the merging label Jul 13, 2025

pytorchmergebot closed this in 1f57e0e Jul 13, 2025

pytorchmergebot added Merged and removed merging labels Jul 13, 2025

fduwjj added a commit that referenced this pull request Jul 14, 2025

[CPU] Support GQA for flash attention (#157893)

54c3e13

As many models require GQA, we support it in flash attention for CPU path. Approved by: https://github.com/mingfeima, https://github.com/jansel [ghstack-poisoned]

pianpwk mentioned this pull request Jul 18, 2025

[export] set enable_gqa in export flash->math decomp #158604

Closed

github-actions bot deleted the support_gqa_cpu branch August 13, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Support GQA for flash attention #157893

[CPU] Support GQA for flash attention #157893

Uh oh!

Valentine233 commented Jul 9, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jul 9, 2025 •

edited

Loading

Uh oh!

mingfeima left a comment

Uh oh!

mingfeima Jul 10, 2025

Uh oh!

Valentine233 Jul 10, 2025

Uh oh!

Valentine233 commented Jul 13, 2025

Uh oh!

pytorchmergebot commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[CPU] Support GQA for flash attention #157893

[CPU] Support GQA for flash attention #157893

Uh oh!

Conversation

Valentine233 commented Jul 9, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157893

✅ No Failures

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Valentine233 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Valentine233 commented Jul 13, 2025

Uh oh!

pytorchmergebot commented Jul 13, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Valentine233 commented Jul 9, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 9, 2025 •

edited

Loading