KEMBAR78
[Inductor] Improve RoPE by BoyuanFeng · Pull Request #161420 · pytorch/pytorch · GitHub
Skip to content

Conversation

@BoyuanFeng
Copy link
Contributor

@BoyuanFeng BoyuanFeng commented Aug 25, 2025

This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:

q: [B, Hq, S, D]
k: [B, Hkv, S, D]

Hq=32, Hkv=8, D=128 following Llama3 setting.

image

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161420

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c121f59 with merge base 9491d28 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@BoyuanFeng BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request ci-no-td Do not run TD on this PR labels Aug 26, 2025
@BoyuanFeng BoyuanFeng marked this pull request as ready for review August 26, 2025 23:12
@BoyuanFeng BoyuanFeng requested a review from eellison August 26, 2025 23:12
@BoyuanFeng BoyuanFeng marked this pull request as draft August 27, 2025 05:23

# Threshold to decide if a kernel has small memory access in bytes
# Default value is 16 MB which is arbitrarily selected.
small_memory_access_threshold: int = 16777216
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we run some rough benchmarks on this threshold for rope if you haven't? It would be good to know in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can see the phase out
image

@shunting314
Copy link
Contributor

Sorry I still feel a bit skeptical here, can you clarify a bit?

  1. The main motivation here is to make sure the two kernels that applying rope to Q and K can be fused, right? But the common memory access for these 2 kernels is the frequency-tensor (sine/cosine etc) which would be broadcasted and is small (compared to Q/K).
  2. The benefit of less number of kernels can be mostly achieved by cudagraphs
  3. expanding ir.Node shape in general sounds a bit tricky and need be careful to not hurt perf

@shunting314
Copy link
Contributor

Stamp it since it's off by default. But I'm not fully convenience to get 32us (1 us per layer) for the whole llama3-8b inference by adding this complexity to the compiler. Maybe try to find if the optimization can be more broadly applied

@shunting314 shunting314 self-requested a review September 5, 2025 19:11
@shunting314
Copy link
Contributor

Also make sure to address elias's comment above before tuning this on by default

@BoyuanFeng
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ProExpertProg
Copy link

ProExpertProg commented Sep 5, 2025

The benefit of less number of kernels can be mostly achieved by cudagraphs

@shunting314 we have found that not to be the case in vLLM, and the extra kernel call is expensive, even with cudagraph enabled

@shunting314
Copy link
Contributor

@ProExpertProg

we have found that not to be the case in vLLM, and the extra kernel call is expensive, even with cudagraph enabled

Can you elaborate a bit? Do you mean in cases when there are a lot of small kernels or in general? Benchmarking seems to show the cost is around 1 us.

daisyden pushed a commit to daisyden/pytorch that referenced this pull request Sep 8, 2025
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: pytorch#161420
Approved by: https://github.com/shunting314
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: pytorch#161420
Approved by: https://github.com/shunting314
@BoyuanFeng BoyuanFeng added this to the 2.9.0 milestone Sep 19, 2025
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: pytorch#161420
Approved by: https://github.com/shunting314
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: pytorch#161420
Approved by: https://github.com/shunting314
@ProExpertProg
Copy link

@shunting314 yeah if you look at vllm-project/vllm#22293, you can see that currently the generated sequence of 3 triton kernels causes a significant overhead.

dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
This PR fuses ROPE from 2 kernels into 1 kernel.

Shape:
```
q: [B, Hq, S, D]
k: [B, Hkv, S, D]
```

`Hq=32, Hkv=8, D=128` following Llama3 setting.

<img width="980" height="624" alt="image" src="https://github.com/user-attachments/assets/652a8227-6f1d-465c-97fd-2b0af41f8ed9" />

Pull Request resolved: pytorch#161420
Approved by: https://github.com/shunting314
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants