Clear CompiledTritonKernel cache after each inductor compile #146925

jamesjwu · 2025-02-11T19:43:57Z

Stack from ghstack (oldest at bottom):

-> Clear CompiledTritonKernel cache after each inductor compile #146925
Only call triton in worker process, kick off worker processes earlier, during inductor codegen #146417

Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in :

compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn)
# run compiled_1
out_compiled = compiled_1
compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2)

Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher.

Found this bug testing internal inference models.

This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: #143408

Differential Revision: D69476856

NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on Phabricator!

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

@eellison

Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. same source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future again. This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: #143408 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)! [ghstack-poisoned]

pytorch-bot · 2025-02-11T19:44:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146925

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 5 Pending, 4 Unrelated Failures

As of commit 6cf05de with merge base 30cbf13 ():

NEW FAILURE - The following job has failed:

inductor / unit-test / cuda12.4-py3.13-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
'Test'

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / unit-test / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-02-11T19:44:13Z

This pull request was exported from Phabricator. Differential Revision: D69476856

Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. same source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future again. This does not remove the caching support for eellison's caching for prologue benchmarking, because that happens under the same compile: #143408 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)! ghstack-source-id: 265875622 Pull Request resolved: #146925

laithsakka

also if you can followup with a unit test would be great

laithsakka · 2025-02-11T19:50:36Z

torch/_inductor/compile_fx.py

        compiled_graph.post_compile(example_inputs, cudagraphs, constants)

    log.debug("FX codegen and compilation took %.3fs", time.time() - start)
+    # Clear Compiled Triton Kernels per inductor compile


can you add a comment explaining why this is important? emphasize it should be done

jamesjwu · 2025-02-11T20:54:07Z

The mergebase is too old here, will rebase

Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher. Found this bug testing internal inference models. This does not remove the caching support for eellison's caching for prologue benchmarking, because that happens under the same compile: #143408 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

jamesjwu · 2025-02-11T21:10:11Z

Having a bit of trouble isolating the exact model that breaks when this happens (not just a simple add_mm or something, a complicated internal model). May land this and then add a unit test

facebook-github-bot · 2025-02-11T21:10:16Z

This pull request was exported from Phabricator. Differential Revision: D69476856

Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher. Found this bug testing internal inference models. This does not remove the caching support for eellison's caching for prologue benchmarking, because that happens under the same compile: #143408 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)! cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

facebook-github-bot · 2025-02-11T21:18:07Z

This pull request was exported from Phabricator. Differential Revision: D69476856

@eellison

Pull Request resolved: #146925 Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in : ``` compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn) # run compiled_1 out_compiled = compiled_1 compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2) ``` Where fn1 and fn2 are very similar (i.e. same source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future again. This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: #143408 ghstack-source-id: 265898129 Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)!

laithsakka · 2025-02-11T21:21:23Z

Having a bit of trouble isolating the exact model that breaks when this happens (not just a simple add_mm or something, a complicated internal model). May land this and then add a unit test

if its not big trouble i figured that sometime its hard to create unit test repos

facebook-github-bot · 2025-02-12T01:22:14Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-02-12T01:23:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-12T02:29:20Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / unit-test / cuda12.4-py3.13-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

huydhn · 2025-02-12T02:35:46Z

@pytorchbot merge -i

huydhn · 2025-02-12T02:36:17Z

@pytorchbot merge -f 'Bypass ROCm unstable jobs'

huydhn · 2025-02-12T02:36:46Z

@pytorchbot merge -f 'Bypass ROCm unstable jobs'

pytorchmergebot · 2025-02-12T02:38:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor module: inductor labels Feb 11, 2025

facebook-github-bot added the fb-exported label Feb 11, 2025

jamesjwu requested review from aorenste, eellison, jansel, laithsakka and masnesral February 11, 2025 19:44

jamesjwu added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Feb 11, 2025

laithsakka approved these changes Feb 11, 2025

View reviewed changes

laithsakka reviewed Feb 11, 2025

View reviewed changes

jansel approved these changes Feb 12, 2025

View reviewed changes

pytorchmergebot added the merging label Feb 12, 2025

pytorchmergebot removed the merging label Feb 12, 2025

pytorchmergebot added the merging label Feb 12, 2025

pytorchmergebot added the Merged label Feb 12, 2025

pytorchmergebot closed this in 28a2ab6 Feb 12, 2025

pytorchmergebot removed the merging label Feb 12, 2025

github-actions bot deleted the gh/jamesjwu/109/head branch March 23, 2025 02:17

Clear CompiledTritonKernel cache after each inductor compile #146925

Clear CompiledTritonKernel cache after each inductor compile #146925

Uh oh!

Conversation

jamesjwu commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146925

❌ 1 New Failure, 5 Pending, 4 Unrelated Failures

Uh oh!

facebook-github-bot commented Feb 11, 2025

Uh oh!

laithsakka left a comment

Choose a reason for hiding this comment

Uh oh!

laithsakka Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

jamesjwu commented Feb 11, 2025

Uh oh!

jamesjwu commented Feb 11, 2025

Uh oh!

facebook-github-bot commented Feb 11, 2025

Uh oh!

facebook-github-bot commented Feb 11, 2025

Uh oh!

laithsakka commented Feb 11, 2025

Uh oh!

facebook-github-bot commented Feb 12, 2025

Uh oh!

pytorchmergebot commented Feb 12, 2025

Merge started

Uh oh!

pytorchmergebot commented Feb 12, 2025

Merge failed

Uh oh!

huydhn commented Feb 12, 2025

Uh oh!

huydhn commented Feb 12, 2025

Uh oh!

huydhn commented Feb 12, 2025

Uh oh!

pytorchmergebot commented Feb 12, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jamesjwu commented Feb 11, 2025 •

edited

Loading

pytorch-bot bot commented Feb 11, 2025 •

edited

Loading