[Flex attention] Fix flex attention head broadcast #163426

Isalia20 · 2025-09-20T22:43:11Z

Fixes part of #163314

In particular bug: Bug 1: H=None Broadcasting Produces Incorrect Results

This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (mask[:, :, i]). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.

The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via sparse_idx_z = off_zq % 1 and sparse_idx_hq = off_hq % 1 and with a single Q tile q_start // SPARSE_Q_MULTIPLE = 0. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Chillee @drisspg @yanboliang @BoyuanFeng

pytorch-bot · 2025-09-20T22:43:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163426

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bfe0614 with merge base 3938175 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2025-09-20T22:45:20Z

To add the ciflow label ciflow/inductor please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Skylion007 · 2025-09-20T23:10:36Z

Good backport candidate into our latest RC!

torch/nn/attention/flex_attention.py

drisspg

Also I think this might change the shape of the return tensors right? that actually might be a good thing but would double check this doesn't break other blockmask tests

Isalia20 · 2025-09-22T20:43:02Z

Yes it changes shape of the getitem on blockmask, updated other tests

drisspg

Thank you!

Isalia20 · 2025-09-23T06:41:04Z

I wonder if I should merge this, directly or add some warning first and then merge the change. It's somewhat bc breaking since the shapes of blockmask's returned have changed. @drisspg wdyt?

drisspg

FWIW the main usage for this AFAIK is in gpt-fast: https://github.com/meta-pytorch/gpt-fast/blob/6ecad9b5b6b987d17ac4303965545873d0192086/generate.py#L74

and this is using tensors as an index and so we keep the dim. The slicing operation feels a little weird tbh I would prefer if users manually edited the bits and then created a new BM from_kv_blocks and we wouldn't run into the not setting a score mod problem

This seems to be a pretty big foot gun for what is essentially syntactic sugar. So, IMO I think its okay to land and call it a bug fix.

Can you also update the PR w/ a description as to why this fixes the issue. AFAIK this looks like essentially a bad interaction where the sliced mask_mod w/ 1 less shape is interpretable even though it shouldn't be and so the kernel is reading bogus values

Isalia20 · 2025-09-23T07:00:55Z

Updated description

Isalia20 · 2025-09-23T12:53:51Z

@pytorchbot merge

pytorchmergebot · 2025-09-23T12:55:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes part of pytorch#163314 In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results** This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: pytorch#163426 Approved by: https://github.com/drisspg

Camyll · 2025-10-01T16:51:42Z

@pytorchbot cherry-pick --onto release/2.9 --c critical

Fixes part of #163314 In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results** This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: #163426 Approved by: https://github.com/drisspg (cherry picked from commit 1a42656)

pytorchbot · 2025-10-01T16:57:09Z

Cherry picking #163426

The cherry pick PR is at #164368 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v.2.9.0] Release Tracker #162497 (comment)

Details for Dev Infra team

Raised by workflow job

[Flex attention] Fix flex attention head broadcast (#163426) Fixes part of #163314 In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results** This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting. The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error Pull Request resolved: #163426 Approved by: https://github.com/drisspg (cherry picked from commit 1a42656) Co-authored-by: Isalia20 <irakli.salia854@gmail.com>

fix the slicing of the block mask with ints producing bad outputs

bf1d254

Isalia20 requested review from albanD, jbschlosser and mikaylagawarecki as code owners September 20, 2025 22:43

pytorch-bot bot added the module: inductor label Sep 20, 2025

Isalia20 requested a review from drisspg September 20, 2025 22:43

Isalia20 added the ciflow/inductor label Sep 20, 2025

pytorch-bot bot removed the ciflow/inductor label Sep 20, 2025

Isalia20 added module: flex attention release notes: nn release notes category labels Sep 20, 2025

pytorchbot added the open source label Sep 20, 2025

Skylion007 modified the milestones: 2.10.0, 2.9.0 Sep 20, 2025

albanD removed their request for review September 22, 2025 15:18

drisspg reviewed Sep 22, 2025

View reviewed changes

torch/nn/attention/flex_attention.py Outdated Show resolved Hide resolved

drisspg reviewed Sep 22, 2025

View reviewed changes

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 22, 2025

update

bfe0614

Isalia20 added the ciflow/inductor label Sep 22, 2025

drisspg approved these changes Sep 22, 2025

View reviewed changes

drisspg added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 22, 2025

drisspg reviewed Sep 23, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 23, 2025

pytorchmergebot closed this in 1a42656 Sep 23, 2025

pytorchmergebot added Merged and removed merging labels Sep 23, 2025

pytorchbot mentioned this pull request Oct 1, 2025

[v.2.9.0] Release Tracker #162497

Closed

[Flex attention] Fix flex attention head broadcast #163426

[Flex attention] Fix flex attention head broadcast #163426

Uh oh!

Conversation

Isalia20 commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163426

✅ No Failures

Uh oh!

pytorch-bot bot commented Sep 20, 2025

Uh oh!

Skylion007 commented Sep 20, 2025

Uh oh!

Uh oh!

drisspg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Isalia20 commented Sep 22, 2025

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

Isalia20 commented Sep 23, 2025

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

Isalia20 commented Sep 23, 2025

Uh oh!

Isalia20 commented Sep 23, 2025

Uh oh!

pytorchmergebot commented Sep 23, 2025

Merge started

Uh oh!

Camyll commented Oct 1, 2025

Uh oh!

pytorchbot commented Oct 1, 2025

Cherry picking #163426

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Isalia20 commented Sep 20, 2025 •

edited

Loading

pytorch-bot bot commented Sep 20, 2025 •

edited

Loading

drisspg left a comment •

edited

Loading