Only call triton in worker process, kick off worker processes earlier, during inductor codegen #146417

jamesjwu · 2025-02-04T16:20:19Z

Stack from ghstack (oldest at bottom):

-> Only call triton in worker process, kick off worker processes earlier, during inductor codegen #146417

Big idea

This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.

Implementation Overview

In total, the diff does the following:

Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
Extend @eellison's future cache to a class, mostly as a refactor
Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent
async_compile.triton call that occurs after codegen to cache hit on cold start.
In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.
Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Differential Revision: D69123174

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov

@eellison

### Big idea This PR extends #144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism. ### Implementation Overview In total, the diff does the following: - Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes - Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers - Extend @eellison's future cache to a class, mostly as a refactor - Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent async_compile.triton call that occurs after codegen to cache hit on cold start. In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts. Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen. ### Can we split the diff for easier review? It's best if this diff lands atomically with all of these changes, as doing the ahead of time codegen compile is only performant if we replace TritonFuture with LambdaFuture(as we don't need to load the triton kernel on the main process). However, I've made a diff stack for easier reviewing here: - D69070048 - Run async_compile.triton ahead of time in Scheduler.codegen - D68633454 - Only call triton in worker process Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/) [ghstack-poisoned]

pytorch-bot · 2025-02-04T16:20:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146417

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit c9299aa with merge base 2a55311 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-02-04T16:20:31Z