-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[inductor] Parallelize Max Autotune step 2: Use all GPUs #107983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107983
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit ff6a940 with merge base 0f88d93 ( UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: cdf42ba Pull Request resolved: #107983
…l GPUs" Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 8dcf2be Pull Request resolved: #107983
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have test failures. We might need to turn this off by default for now, or only on during aot_compilation. can do that in subsequent pr
…une step 2: Use all GPUs" Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…all GPUs" Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: b5cfd23 Pull Request resolved: #107983
| @@ -1,10 +1,15 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to support the type hint Queue[TuningProcess] in the 3.8 builds since apparently a subscriptable Queue type was added later.
| env=env, | ||
| ) | ||
|
|
||
| # register the exit handler for the parent process so it will terminate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moves to the new TuningProcessPool class below.
| print(f"{len(choices)} tuning requests:") | ||
|
|
||
| def benchmark_in_current_process(choice): | ||
| if DEBUG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff looks a little weird here. debug_str() definition moved up from below and this was the first few lines of the previous impl. But the start_ts was never actually used, so I removed these two lines.
| torch.cuda.synchronize() # shake out any CUDA errors | ||
| return result | ||
|
|
||
| def benchmark_in_sub_process(choice): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff looks a little weird here too. The benchmark_in_sub_process impl is below.
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` [ghstack-poisoned]
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 4698d74 Pull Request resolved: #107983
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool ! tagging @shunting314 who did a lot of the initial impl
torch/_inductor/autotune_process.py
Outdated
| return | ||
|
|
||
| count = count or torch.cuda.device_count() | ||
| assert count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should set count so it is no greater than torch.cuda.device_count()
|
Can you also share how much speedup we can get by leveraging multi-GPU. I actually have a similar PR sometime back https://github.com/pytorch/pytorch/pull/96807/files to extend the sub-process autotuning to multi-process/multi-GPU autotuning . We didn't push forward at that time because people thought a global cache is more promising. |
torch/_inductor/select_algorithm.py
Outdated
|
|
||
| def benchmark_in_current_process(options): | ||
| timings = {} | ||
| for choice in options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we uniformly call it 'choice' and 'choices'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shunting314, sure I'll change. That's what I had initially, but as it turns out this is a nested function and the outer method also has a param called 'choices'. So I didn't love the ambiguity. It would be easy to misspell 'options', for example, and be none-the-wiser because we'd silently reference the 'options' from the outer scope. I guess this is one of the reasons why I personally don't love nested functions.
@shunting314: here's a side-by-side comparison using the hf_Bart benchmark: |
… all GPUs" Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` [ghstack-poisoned]
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: b7fb445 Pull Request resolved: #107983
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` [ghstack-poisoned]
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` ghstack-source-id: 0743a2d Pull Request resolved: #107983
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| # Launch the child processes and push a msg to "warm up" | ||
| self.processes = Queue() | ||
| for device in range(count): | ||
| p = TuningProcess(device=device if config.autotune_multi_device else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masnesral @eellison I believe, there is a subtlety here worth revisiting.
Above we iterate over the range(count) and pass the device counter variable downstream, which is then set as the value of the CUDA_VISIBLE_DEVICES env variable when launching the subprocess. Now if the user has set CUDA_VISIBLE_DEVICES=3,4,5 for the parent process, the torch.cuda.device_count() above will be 3, and we'll initialize the 3 subprocesses with CUDA_VISIBLE_DEVICES=0, =1, and =2. I guess, this is not what the user would expect?
In AITemplate, we actually parse the CUDA_VISIBLE_DEVICES env variable (fetched by target.dev_select_flag() here) to detect the user-specified devices. Perhaps we should do something similar here, too? Or maybe there is a torch.cuda.<something> to detect the visible device IDs (didn't find from a quick check)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aakhundov, ah sure. That sounds like a better behavior. Lemme work on a follow-up.
| process.put(choice.bmreq) | ||
| try: | ||
| return process.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a timeout? Can this get stuck forever otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aakhundov, I'm not sure. Do you have a scenario in mind that would lead to a sub-process hanging? I'd love a repro to play around with. I'm new here and my naive assumption was that these kernels are fairly constrained and we wouldn't be concerned about arbitrary code. If the process crashes or exits unexpectedly, then we handle that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an exact scenario in mind. Just, given that we're relying on STDIN- / STDOUT-based communication, assuming that something can go wrong there leading to indefinitely long waiting. In AIT, we have a timeout for the whole profiling session (which, to be fair, we had to increase a couple of times due to longer e2e profiling that was initially expected).
|
@pytorchbot revert -m "fbcode failures" |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot revert -m "fbcode failures" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@masnesral your PR has been successfully reverted. |
…7983)" This reverts commit 2c61313. Reverted #107983 on behalf of https://github.com/masnesral due to fbcode failures ([comment](#107983 (comment)))
) Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them. Test Plan: `python test/inductor/test_max_autotune.py` `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` `TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bart` Pull Request resolved: pytorch#107983 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: pytorch#107982
Stack from ghstack (oldest at bottom):
Summary: Step 2 in revamping subprocess autotune to support multiple GPUs: use a pool of subprocesses and distribute benchmark calls across them.
Test Plan:
python test/inductor/test_max_autotune.pyTORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_BartTORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --inference --only hf_Bartcc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov