[Inductor] Improve RoPE #161420

BoyuanFeng · 2025-08-25T17:59:37Z

This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:

q: [B, Hq, S, D]
k: [B, Hkv, S, D]

Hq=32, Hkv=8, D=128 following Llama3 setting.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-08-25T17:59:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161420

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c121f59 with merge base 9491d28 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

exclamaforte · 2025-08-28T19:32:54Z

torch/_inductor/config.py


+# Threshold to decide if a kernel has small memory access in bytes
+# Default value is 16 MB which is arbitrarily selected.
+small_memory_access_threshold: int = 16777216


could we run some rough benchmarks on this threshold for rope if you haven't? It would be good to know in general.

Yes we can see the phase out

shunting314 · 2025-09-05T17:55:40Z

Sorry I still feel a bit skeptical here, can you clarify a bit?

The main motivation here is to make sure the two kernels that applying rope to Q and K can be fused, right? But the common memory access for these 2 kernels is the frequency-tensor (sine/cosine etc) which would be broadcasted and is small (compared to Q/K).
The benefit of less number of kernels can be mostly achieved by cudagraphs
expanding ir.Node shape in general sounds a bit tricky and need be careful to not hurt perf

shunting314 · 2025-09-05T19:11:30Z

Stamp it since it's off by default. But I'm not fully convenience to get 32us (1 us per layer) for the whole llama3-8b inference by adding this complexity to the compiler. Maybe try to find if the optimization can be more broadly applied

shunting314 · 2025-09-05T19:13:44Z

Also make sure to address elias's comment above before tuning this on by default

BoyuanFeng · 2025-09-05T20:47:00Z

@pytorchbot merge

pytorchmergebot · 2025-09-05T20:49:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ProExpertProg · 2025-09-05T21:32:37Z

The benefit of less number of kernels can be mostly achieved by cudagraphs

@shunting314 we have found that not to be the case in vLLM, and the extra kernel call is expensive, even with cudagraph enabled

shunting314 · 2025-09-05T21:43:41Z

@ProExpertProg

we have found that not to be the case in vLLM, and the extra kernel call is expensive, even with cudagraph enabled

Can you elaborate a bit? Do you mean in cases when there are a lot of small kernels or in general? Benchmarking seems to show the cost is around 1 us.

This PR fuses ROPE from 2 kernels into 1 kernel. Shape: ``` q: [B, Hq, S, D] k: [B, Hkv, S, D] ``` `Hq=32, Hkv=8, D=128` following Llama3 setting. <img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" /> Pull Request resolved: pytorch#161420 Approved by: https://github.com/shunting314

ProExpertProg · 2025-09-25T01:43:41Z

@shunting314 yeah if you look at vllm-project/vllm#22293, you can see that currently the generated sequence of 3 triton kernels causes a significant overhead.

This PR fuses ROPE from 2 kernels into 1 kernel. Shape: ``` q: [B, Hq, S, D] k: [B, Hkv, S, D] ``` `Hq=32, Hkv=8, D=128` following Llama3 setting. <img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" /> Pull Request resolved: pytorch#161420 Approved by: https://github.com/shunting314

init

8bed37b

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 25, 2025

BoyuanFeng added topic: not user facing topic category and removed module: inductor ciflow/inductor labels Aug 25, 2025

BoyuanFeng marked this pull request as draft August 25, 2025 17:59

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 25, 2025

BoyuanFeng added 7 commits August 25, 2025 13:17

fuse into 1 kernel via index mod boundary

5fc60a0

updating dep name works

f21c215

implement heuristic to gate replace_boundary_with_mask

3818b88

Merge branch 'main' into bf/rope

a0908ae

only support computed buffer

2427678

add a test

48ee255

fix input mutation

2f0b770

BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request ci-no-td Do not run TD on this PR labels Aug 26, 2025

BoyuanFeng added 2 commits August 26, 2025 15:58

add doc

a9b09e7

nit

e8f3e09

BoyuanFeng requested a review from exclamaforte August 26, 2025 23:02

BoyuanFeng marked this pull request as ready for review August 26, 2025 23:12

BoyuanFeng requested a review from eellison August 26, 2025 23:12

BoyuanFeng added 2 commits August 26, 2025 17:55

fix halide

218ff64

Merge branch 'main' into bf/rope

0b4836b

BoyuanFeng marked this pull request as draft August 27, 2025 05:23

exclamaforte reviewed Aug 28, 2025

View reviewed changes

BoyuanFeng added 2 commits August 28, 2025 18:46

fix precedence issue in mod str

1b6a2bd

Merge branch 'main' into bf/rope

d5cf6ce

BoyuanFeng added 3 commits August 30, 2025 15:05

Merge commit 'c83cbd2f2a' into bf/rope

89a05d2

Merge branch 'main' into bf/rope

4328cc0

nit

199650e

BoyuanFeng mentioned this pull request Sep 2, 2025

[Feature]: Optimize RoPE vllm-project/vllm#22293

Open

1 task

BoyuanFeng added 5 commits September 2, 2025 16:33

nit

4945989

Merge branch 'main' into bf/rope

2cee751

replace index mod with ops.masked

046ad9e

back

63c70cd

add config

c121f59

shunting314 self-requested a review September 5, 2025 19:11

shunting314 approved these changes Sep 5, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 5, 2025

pytorchmergebot closed this in 771f369 Sep 5, 2025

pytorchmergebot added Merged and removed merging labels Sep 5, 2025

BoyuanFeng added this to the 2.9.0 milestone Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Improve RoPE #161420

[Inductor] Improve RoPE #161420

Uh oh!

BoyuanFeng commented Aug 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 25, 2025 •

edited

Loading

Uh oh!

exclamaforte Aug 28, 2025

Uh oh!

BoyuanFeng Aug 29, 2025

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

BoyuanFeng commented Sep 5, 2025

Uh oh!

pytorchmergebot commented Sep 5, 2025

Uh oh!

ProExpertProg commented Sep 5, 2025 •

edited

Loading

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Inductor] Improve RoPE #161420

[Inductor] Improve RoPE #161420

Uh oh!

Conversation

BoyuanFeng commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161420

✅ No Failures

Uh oh!

exclamaforte Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

BoyuanFeng commented Sep 5, 2025

Uh oh!

pytorchmergebot commented Sep 5, 2025

Merge started

Uh oh!

ProExpertProg commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shunting314 commented Sep 5, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BoyuanFeng commented Aug 25, 2025 •

edited

Loading

pytorch-bot bot commented Aug 25, 2025 •

edited

Loading

ProExpertProg commented Sep 5, 2025 •

edited

Loading