-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Turn on static cuda launcher in OSS #151691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151691
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 29 PendingAs of commit 01feeaf with merge base 6efc572 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
I double checked the regression for BartForConditionalGenerati, and saw that 4/19's perf run on main was an outlier in terms of execution speedup. It usually floats around 1.4x, so it's unlikely that static cuda launcher is slowing it down. For example, if compared to the 4/19 or 4/21 run, you get something like this: ![]() |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
Hi, may I suggest we disable use_static_cuda_launcher on XPU by default? Since the implementation is specific for CUDA, this PR will break XPU. I'll generalize |
This reverts commit e31e2d2. Reverted pytorch#151691 on behalf of https://github.com/malfet due to This breaks tests, see https://hud.pytorch.org/hud/pytorch/pytorch/c1f51cf2c4fc8259fa48bc506320118e0e907906/1?per_page=50&name_filter=linux-focal-cuda12.6-py3.10&mergeEphemeralLF=true ([comment](pytorch#151691 (comment)))
@pytorchbot merge |
Oops just saw that comment, will do that first |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
@etaf does XPU not change the device_type listed here? I had figured this would gate static cuda launcher to only cuda device type.
|
I'm somewhat confident that that check prevents XPU from using StaticCudaLauncher, as tests like
|
Great, thanks! |
Thanks! then this PR will not break XPU, please go ahead. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge -f "Workflow has been scheduled?" |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.
Initial round of benchmarks look good, with average compilation time going down by a few percent:

With no changes to runtime perf:

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.
Full benchmark results, showing a ~5% compile time improvement across the board:

https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov