[ROCm] Tune flex-attention and decode to num_stages=1 #139883

jataylo · 2024-11-06T13:08:58Z

The new stream pipeliner on AMD triton backend enables num_stages to function equivalent to NV backend. This upgrade in triton 3.2 will cause OOM issues in flex attention due to num_stages=3 setting, we have tuned this to num_stages=1 which is the best setting for flash attention kernels and avoids the shmem issues.

We will follow up this PR with some config tuning on AMD backend.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-11-06T13:09:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139883

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 48340e6 with merge base 314aa26 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jataylo · 2024-11-07T13:36:55Z

@pytorchbot rebase

pytorchmergebot · 2024-11-07T13:38:31Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-07T13:38:35Z

Successfully rebased flex-num-stages onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout flex-num-stages && git pull --rebase)

bertmaher · 2024-11-07T15:59:20Z

torch/_inductor/kernel/flex_attention.py

+        # On ROCm convert num_stages to 1 to avoid shmem issues
+        configs = [(c[0], c[1], c[2], 1) for c in configs]
+


Where is this config change guarded to only apply to ROCm?

Ah good find, I guarded it in flex_decode but I missed it out here... should be this:

if torch.version.hip: configs = [(c[0], c[1], 1) for c in configs]

bertmaher

lgtm!

bertmaher · 2024-11-07T18:57:27Z

@pytorchbot merge

pytorchmergebot · 2024-11-07T18:59:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fixes #139755 #139621 Follow up fix to #139883 which made the bulk of the changes required but a logic error resulted in ROCm still using h100 configurations. Pull Request resolved: #140270 Approved by: https://github.com/bertmaher

Fixes pytorch#139755 pytorch#139621 The new stream pipeliner on AMD triton backend enables num_stages to function equivalent to NV backend. This upgrade in triton 3.2 will cause OOM issues in flex attention due to num_stages=3 setting, we have tuned this to num_stages=1 which is the best setting for flash attention kernels and avoids the shmem issues. We will follow up this PR with some config tuning on AMD backend. Pull Request resolved: pytorch#139883 Approved by: https://github.com/bertmaher

…#140270) Fixes pytorch#139755 pytorch#139621 Follow up fix to pytorch#139883 which made the bulk of the changes required but a logic error resulted in ROCm still using h100 configurations. Pull Request resolved: pytorch#140270 Approved by: https://github.com/bertmaher

jataylo requested review from drisspg and yanboliang November 6, 2024 13:08

pytorch-bot bot added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm module: inductor module: rocm AMD GPU support for Pytorch labels Nov 6, 2024

jataylo requested a review from bertmaher November 6, 2024 13:09

pytorchbot added the open source label Nov 6, 2024

jataylo added release notes: rocm mandatorylabel topic: not user facing topic category labels Nov 6, 2024

[ROCm] Tune flex-attention and decode to num_stages=1

7f6b3c4

pytorchmergebot force-pushed the flex-num-stages branch from c932ec7 to 7f6b3c4 Compare November 7, 2024 13:38

bertmaher reviewed Nov 7, 2024

View reviewed changes

Guard config change to ROCm

48340e6

jataylo mentioned this pull request Nov 7, 2024

[triton] Update pin for PyTorch 2.6/Triton 3.2 #139206

Closed

bertmaher approved these changes Nov 7, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2024

pytorchmergebot added the merging label Nov 7, 2024

pytorchmergebot added the Merged label Nov 7, 2024

pytorchmergebot closed this in 8d070d2 Nov 7, 2024

pytorchmergebot removed the merging label Nov 7, 2024

jataylo mentioned this pull request Nov 8, 2024

[ROCm] [Upstream Triton] Flex attention Assertion idx < size()' failed. #139621

Closed

jataylo mentioned this pull request Nov 11, 2024

[ROCm] Bug fix for flex attention configs avoiding ROCm path #140270

Closed

jataylo mentioned this pull request Dec 19, 2024

[ROCm] [Flex attention] Memory access fault on nested_tensor UT #139754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Tune flex-attention and decode to num_stages=1 #139883

[ROCm] Tune flex-attention and decode to num_stages=1 #139883

Uh oh!

jataylo commented Nov 6, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 6, 2024 •

edited

Loading

Uh oh!

jataylo commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

bertmaher Nov 7, 2024

Uh oh!

jataylo Nov 7, 2024

Uh oh!

bertmaher left a comment

Uh oh!

bertmaher commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# On ROCm convert num_stages to 1 to avoid shmem issues
		configs = [(c[0], c[1], c[2], 1) for c in configs]

[ROCm] Tune flex-attention and decode to num_stages=1 #139883

[ROCm] Tune flex-attention and decode to num_stages=1 #139883

Uh oh!

Conversation

jataylo commented Nov 6, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139883

✅ No Failures

Uh oh!

jataylo commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Uh oh!

bertmaher Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

jataylo Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

bertmaher left a comment

Choose a reason for hiding this comment

Uh oh!

bertmaher commented Nov 7, 2024

Uh oh!

pytorchmergebot commented Nov 7, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jataylo commented Nov 6, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 6, 2024 •

edited

Loading