KEMBAR78
Turn on static cuda launcher in OSS by jamesjwu · Pull Request #151691 · pytorch/pytorch · GitHub
Skip to content

Conversation

jamesjwu
Copy link
Contributor

@jamesjwu jamesjwu commented Apr 18, 2025

Stack from ghstack (oldest at bottom):

After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS.

Initial round of benchmarks look good, with average compilation time going down by a few percent:
image

With no changes to runtime perf:
image

There are a few noisy models I want to double check, though, so will run some more tests before accepting review.

Full benchmark results, showing a ~5% compile time improvement across the board:
https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405
image

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Apr 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151691

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 29 Pending

As of commit 01feeaf with merge base 6efc572 (image):
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jamesjwu added a commit that referenced this pull request Apr 18, 2025
ghstack-source-id: 4d0d534
Pull Request resolved: #151691
@jamesjwu jamesjwu added the topic: not user facing topic category label Apr 18, 2025
[ghstack-poisoned]
@jamesjwu jamesjwu marked this pull request as draft April 21, 2025 15:09
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull request Apr 21, 2025
ghstack-source-id: 606ac45
Pull Request resolved: #151691
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull request Apr 21, 2025
ghstack-source-id: 68bcacf
Pull Request resolved: #151691
@jamesjwu jamesjwu added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 22, 2025
[ghstack-poisoned]
[ghstack-poisoned]
jamesjwu added a commit that referenced this pull request Apr 22, 2025
ghstack-source-id: c718c76
Pull Request resolved: #151691
@jamesjwu jamesjwu changed the title Turn on static cuda launcher Turn on static cuda launcher in OSS Apr 22, 2025
@jamesjwu jamesjwu requested review from eellison, jansel and oulgen April 23, 2025 02:33
@jamesjwu
Copy link
Contributor Author

jamesjwu commented Apr 23, 2025

I double checked the regression for BartForConditionalGenerati, and saw that 4/19's perf run on main was an outlier in terms of execution speedup. It usually floats around 1.4x, so it's unlikely that static cuda launcher is slowing it down.

For example, if compared to the 4/19 or 4/21 run, you get something like this:

image

@jamesjwu jamesjwu marked this pull request as ready for review April 23, 2025 15:24
@jamesjwu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
jamesjwu added a commit that referenced this pull request Apr 24, 2025
ghstack-source-id: cb1c5a8
Pull Request resolved: #151691
@seemethere
Copy link
Member

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@etaf
Copy link
Collaborator

etaf commented Apr 25, 2025

Hi, may I suggest we disable use_static_cuda_launcher on XPU by default? Since the implementation is specific for CUDA, this PR will break XPU. I'll generalize use_static_cuda_launcher with use_static_gpu_launcher and turn it on for XPU. Thanks.

@jamesjwu
Copy link
Contributor Author

@pytorchbot merge

@jamesjwu
Copy link
Contributor Author

Oops just saw that comment, will do that first

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jamesjwu
Copy link
Contributor Author

@etaf does XPU not change the device_type listed here? I had figured this would gate static cuda launcher to only cuda device type.

if triton_meta.get("device_type", None) != "cuda":

@jamesjwu
Copy link
Contributor Author

I'm somewhat confident that that check prevents XPU from using StaticCudaLauncher, as tests like inductor/test_xpu_basic.py pass with this PR. From this line we can see "xpu" does appear as a valid device_type:

bin_type = {"hip": "hsaco", "xpu": "spv"}.get(self.device_props.type, "cubin")

@etaf
Copy link
Collaborator

etaf commented Apr 25, 2025

I'm somewhat confident that that check prevents XPU from using StaticCudaLauncher, as tests like inductor/test_xpu_basic.py pass with this PR. From this line we can see "xpu" does appear as a valid device_type:

bin_type = {"hip": "hsaco", "xpu": "spv"}.get(self.device_props.type, "cubin")

Great, thanks!

@etaf
Copy link
Collaborator

etaf commented Apr 25, 2025

@etaf does XPU not change the device_type listed here? I had figured this would gate static cuda launcher to only cuda device type.

if triton_meta.get("device_type", None) != "cuda":

Thanks! then this PR will not break XPU, please go ahead.

@jamesjwu
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented Apr 25, 2025

@pytorchbot merge -f "Workflow has been scheduled?"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants