FlexAttention support for NJT #136792

jbschlosser · 2024-09-26T20:49:10Z

Stack from ghstack (oldest at bottom):

This PR adds FlexAttention + NJT support. In particular:

To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user score_mod / mask_mod functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
Provides py_impls for NestedTensor to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
Adds barebones new_empty() support to NJT since FlexAttention utilizes this repeatedly; right now, only new_empty() with a shape of () is supported
Tests that FlexAttention with a causal mask matches causal SDPA
Adds a new public API for FlexAttention usage:
- create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile) - NJT analogue for create_block_mask() that utilizes the njt's ragged structure to create an appropriately-sized block mask (e.g. (1, 1, total_seqlen, total_seqlen)). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
  - Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.

Example usage:

def causal_mask(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query)  # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)

def causal_score_mod(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)

TODO:

~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
~~Some cleanup~~
~~njt_score_mod_adapter~~
~~Q: should create_njt_block_mask() call njt_mask_mod_adapter() so we don't need two calls?~~
Can we avoid materializing the sum(s) length seq_idx used for conversion between stacked sequence -> sequence relative indices?
- Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
~~Demonstrate non-causal mask~~
Support non-contiguous NJTs with holes (booted to future PR)

cc @cpuhrsch @bhosmer @drisspg @soulitzer @davidberard98 @YuqingJ @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @rec

[ghstack-poisoned]

pytorch-bot · 2024-09-26T20:49:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136792

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 081e220 with merge base 239a21f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR adds FlexAttention + NJT support. In particular: * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Fixes a NJT + autograd.Function integration issue; the autograd engine calls `zeros()` [here](https://github.com/pytorch/pytorch/blob/5789f8d5dc2b2b65dca740d3e4dd135a36f0c545/torch/csrc/autograd/python_function.cpp#L404-L408), which breaks with nested ints. This is addressed by storing the entire NJT so `zeros_like()` can be called instead (possibly a bad way to handle this, but we don't have proper factory function support yet soulitzer) * Tests that FlexAttention with a causal mask matches causal SDPA TODO: * Determine the right level of abstraction for public API helpers + move them alongside other helpers * Some cleanup * Demonstrate non-causal mask? [ghstack-poisoned]

ghstack-source-id: 2032e78 Pull Request resolved: #136792

This PR adds FlexAttention + NJT support. In particular: * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Fixes a NJT + autograd.Function integration issue; the autograd engine calls `zeros()` [here](https://github.com/pytorch/pytorch/blob/5789f8d5dc2b2b65dca740d3e4dd135a36f0c545/torch/csrc/autograd/python_function.cpp#L404-L408), which breaks with nested ints. This is addressed by storing the entire NJT so `zeros_like()` can be called instead (possibly a bad way to handle this, but we don't have proper factory function support yet soulitzer) * Tests that FlexAttention with a causal mask matches causal SDPA TODO: * Determine the right level of abstraction for public API helpers + move them alongside other helpers * Some cleanup * Demonstrate non-causal mask? [ghstack-poisoned]

ghstack-source-id: c61a348 Pull Request resolved: #136792

This PR adds FlexAttention + NJT support. In particular: * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * ~~Fixes a NJT + autograd.Function integration issue; the autograd engine calls `zeros()` [here](https://github.com/pytorch/pytorch/blob/5789f8d5dc2b2b65dca740d3e4dd135a36f0c545/torch/csrc/autograd/python_function.cpp#L404-L408), which breaks with nested ints. This is addressed by storing the entire NJT so `zeros_like()` can be called instead (possibly a bad way to handle this, but we don't have proper factory function support yet soulitzer)~~ Moved below in the PR stack * Tests that FlexAttention with a causal mask matches causal SDPA TODO: * Determine the right level of abstraction for public API helpers + move them alongside other helpers * Some cleanup * Demonstrate non-causal mask? * Support non-contiguous NJTs with holes (cc ani300) [ghstack-poisoned]

This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_njt_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S*, D) key = ... # NJT of shape (B, H, S*, D) value = ... # NJT of shape (B, H, S*, D) # create_njt_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_njt_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: * ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (**booted to future PR**) [ghstack-poisoned]

This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S*, D) key = ... # NJT of shape (B, H, S*, D) value = ... # NJT of shape (B, H, S*, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: * ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (**booted to future PR**) [ghstack-poisoned]

jbschlosser · 2024-10-28T17:46:50Z

@pytorchbot merge

pytorchmergebot · 2024-10-28T17:48:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S*, D) key = ... # NJT of shape (B, H, S*, D) value = ... # NJT of shape (B, H, S*, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: * ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (**booted to future PR**) Pull Request resolved: pytorch#136792 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#138841

jataylo · 2024-11-04T12:39:33Z

torch/testing/_internal/common_device_type.py

+
+flex_attention_supported_platform = unittest.skipUnless(
+    torch.cuda.is_available()
+    and torch.version.hip is None


Why is this disabled for AMD/ROCm?

cc: @jbschlosser @drisspg

Please can we re-enable ROCm testing for flex here @jbschlosser, if there are UT failures from this change which required skipping on ROCm this should be communicated not just skipping all tests on ROCm. cc: @jeffdaily @jithunnair-amd

Yeah this is a mistake

This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S*, D) key = ... # NJT of shape (B, H, S*, D) value = ... # NJT of shape (B, H, S*, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S*), sum(S*)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: * ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (**booted to future PR**) Pull Request resolved: pytorch#136792 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#138841

#136792 accidentally disabled flex attention UTs on ROCm. Re-enabling. Pull Request resolved: #139632 Approved by: https://github.com/drisspg

pytorch#136792 accidentally disabled flex attention UTs on ROCm. Re-enabling. Pull Request resolved: pytorch#139632 Approved by: https://github.com/drisspg

ezyang · 2024-12-06T15:14:59Z

Were there any performance benchmarks done for this?

FlexAttention support for NJT

4865311

[ghstack-poisoned]

jbschlosser requested review from albanD, mruberry, soulitzer and zou3519 as code owners September 26, 2024 20:49

pytorch-bot bot added ciflow/inductor module: dynamo labels Sep 26, 2024

jbschlosser requested review from drisspg and removed request for mruberry September 26, 2024 20:50

jbschlosser added module: nestedtensor NestedTensor tag see issue #25032 topic: improvements topic category release notes: nested tensor Changes that have a direct impact on nested tensors labels Sep 26, 2024

jbschlosser marked this pull request as draft September 26, 2024 20:56

jbschlosser requested a review from cpuhrsch September 26, 2024 20:59

jbschlosser mentioned this pull request Sep 27, 2024

Fix autograd.Function + NJT when an output grad is None #136875

Closed

jbschlosser added a commit that referenced this pull request Sep 27, 2024

FlexAttention support for NJT

2eccc66

ghstack-source-id: 2032e78 Pull Request resolved: #136792

albanD removed their request for review September 27, 2024 20:34

zou3519 removed their request for review September 30, 2024 14:21

ezyang requested a review from Chillee September 30, 2024 18:29

jbschlosser added a commit that referenced this pull request Sep 30, 2024

FlexAttention support for NJT

f3f4697

ghstack-source-id: c61a348 Pull Request resolved: #136792

jbschlosser requested a review from ani300 October 1, 2024 17:09

jbschlosser mentioned this pull request Oct 2, 2024

[Tracker] Move nested tensors to beta #112398

Open

52 tasks

mikaylagawarecki mentioned this pull request Oct 4, 2024

Transformer building blocks tutorial pytorch/tutorials#3075

Merged

6 tasks

pytorch-bot bot added the module: inductor label Oct 8, 2024

jbschlosser added 2 commits October 17, 2024 12:41

jbschlosser mentioned this pull request Oct 24, 2024

Elide calls to is_nested in Dynamo-traced graphs #138841

Closed

drisspg approved these changes Oct 24, 2024

View reviewed changes

jbschlosser added 3 commits October 24, 2024 15:08

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 28, 2024

pytorchmergebot added the merging label Oct 28, 2024

pytorchmergebot added the Merged label Oct 28, 2024

pytorchmergebot closed this in 8ba9063 Oct 28, 2024

pytorchmergebot removed the merging label Oct 28, 2024

cpuhrsch mentioned this pull request Oct 29, 2024

Support attn_mask in jagged_scaled_dot_product_attention #138993

Open

jataylo reviewed Nov 4, 2024

View reviewed changes

jataylo mentioned this pull request Nov 4, 2024

[ROCm] re-enable flex attention UTs #139632

Closed

jataylo mentioned this pull request Nov 5, 2024

[ROCm] [Flex attention] Memory access fault on nested_tensor UT #139754

Closed

pytorchmergebot pushed a commit that referenced this pull request Nov 6, 2024

[ROCm] re-enable flex attention UTs (#139632)

5f266b5

#136792 accidentally disabled flex attention UTs on ROCm. Re-enabling. Pull Request resolved: #139632 Approved by: https://github.com/drisspg

samvanstroud mentioned this pull request Nov 25, 2024

FlexAttention with compiled block mask is slow when varying sequence lengths #141486

Closed

github-actions bot deleted the gh/jbschlosser/180/head branch December 5, 2024 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FlexAttention support for NJT #136792

FlexAttention support for NJT #136792

Uh oh!

jbschlosser commented Sep 26, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

Uh oh!

jbschlosser commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Uh oh!

jataylo Nov 4, 2024

Uh oh!

jataylo Nov 4, 2024

Uh oh!

jataylo Nov 4, 2024

Uh oh!

drisspg Nov 4, 2024

Uh oh!

ezyang commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

FlexAttention support for NJT #136792

FlexAttention support for NJT #136792

Uh oh!

Conversation

jbschlosser commented Sep 26, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136792

✅ No Failures

Uh oh!

jbschlosser commented Oct 28, 2024

Uh oh!

pytorchmergebot commented Oct 28, 2024

Merge started

Uh oh!

jataylo Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

jataylo Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

jataylo Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

drisspg Nov 4, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbschlosser commented Sep 26, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading