[inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning #109500

masnesral · 2023-09-18T15:06:20Z

Stack from ghstack (oldest at bottom):

-> [inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning #109500

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:

New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
Ran multiprocess autotuning and looked at the output from nvidia-smi pmon to make sure that all GPUs were assigned processes.

Snippet:

    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

…autotuning Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` [ghstack-poisoned]

pytorch-bot · 2023-09-18T15:06:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109500

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e997a24 with merge base d0cc623 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aakhundov · 2023-09-18T15:36:03Z

torch/_inductor/autotune_process.py

+    specified single device. If device is None, don't manipulate the environment.
+    """
+    if device is None:
+        yield


To double check my understanding: if the user sets CUDA_VISIBLE_DEVICES=7 on the parent python process running torch.compile and multi-device autotuning is disabled, we follow this code path, but the env var value should propagate to the child process. Is this correct?

Yeah, lemme spell out the different scenarios since it might not be completely straightforward and I want to make sure the behavior is what we'd want:

## ## Multi-device disabled; CUDA_VISIBLE_DEVICES not set --> also unset in (single) child process ## TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python ~/tune.py 2>&1 | grep "Entering TuningProcess child" [2023-09-18 09:34:43,449] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = None ## ## Multi-device disabled; CUDA_VISIBLE_DEVICES=3 --> (single) child process also sees CUDA_VISIBLE_DEVICES=3 ## TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=3 python ~/tune.py 2>&1 | grep "Entering TuningProcess child" [2023-09-18 09:35:15,198] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 3 ## ## Multi-device ENABED; CUDA_VISIBLE_DEVICES not set --> Use all devices; one per child process ## TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 python ~/tune.py 2>&1 | grep "Entering TuningProcess child" [2023-09-18 09:35:46,227] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 1 [2023-09-18 09:35:46,301] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 2 [2023-09-18 09:35:46,443] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 0 [2023-09-18 09:35:46,552] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 7 [2023-09-18 09:35:46,604] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 3 [2023-09-18 09:35:46,623] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 5 [2023-09-18 09:35:46,657] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 4 [2023-09-18 09:35:46,659] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 6 ## ## Multi-device ENABED; CUDA_VISIBLE_DEVICES=0,1,2 --> Use only the specified devices, one per child process ## TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 CUDA_VISIBLE_DEVICES=0,1,2 python ~/tune.py 2>&1 | grep "Entering TuningProcess child" [2023-09-18 09:36:17,310] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 1 [2023-09-18 09:36:17,371] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 2 [2023-09-18 09:36:17,405] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 0

Thanks for the detailed explanation! This looks correct.

eellison

looks good !

eellison · 2023-09-18T22:15:05Z

test/inductor/test_max_autotune.py

+
+            tuning_pool.terminate()
+
+    def test_tuning_pool_multiple_devices(self):


You might need requires_multigpu decorator here

…evice subprocess autotuning" Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` [ghstack-poisoned]

…uning Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` ghstack-source-id: dfb61f1 Pull Request resolved: #109500

masnesral · 2023-09-18T23:34:58Z

torch/testing/_internal/inductor_utils.py

 from torch._dynamo.backends.registry import register_backend
 from torch._inductor.compile_fx import compile_fx, count_bytes_inner

+import functools


@eellison, mind taking a quick look at the new changes to this file?

…ocess autotuning" Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` [ghstack-poisoned]

…uning Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` ghstack-source-id: f101faf Pull Request resolved: #109500

…SIBLE_DEVICES for multi-device subprocess autotuning" Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` [ghstack-poisoned]

…uning Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess. Test Plan: * New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not. * Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes. Snippet: ``` 1 3442314 C 2 1 - - python 2 3442318 C 2 1 - - python 3 3442320 C 8 2 - - python 4 3442323 C 9 4 - - python 5 3442325 C 10 4 - - python 6 3442327 C 10 4 - - python 7 3442329 C 2 0 - - python 0 3434906 C 0 0 - - python ``` ghstack-source-id: 52fcf7a Pull Request resolved: #109500

masnesral · 2023-09-20T22:27:31Z

@pytorchbot merge

pytorchmergebot · 2023-09-20T22:29:08Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

masnesral · 2023-09-20T22:36:31Z

@pytorchbot merge

pytorchmergebot · 2023-09-20T22:38:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-09-21T04:36:59Z

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

masnesral · 2023-09-21T14:27:42Z

@pytorchbot merge

pytorchmergebot · 2023-09-21T14:29:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions bot added module: inductor ciflow/inductor labels Sep 18, 2023

masnesral mentioned this pull request Sep 18, 2023

[inductor] Parallelize Max Autotune step 2: Use multiple GPUs #109127

Closed

masnesral requested review from aakhundov, eellison and shunting314 September 18, 2023 15:08

aakhundov reviewed Sep 18, 2023

View reviewed changes

eellison approved these changes Sep 18, 2023

View reviewed changes

masnesral changed the title ~~[RFC][inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning~~ [inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning Sep 18, 2023

masnesral requested a review from eellison September 18, 2023 23:34

masnesral commented Sep 18, 2023

View reviewed changes

shunting314 approved these changes Sep 19, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 20, 2023

pytorchmergebot added the merging label Sep 20, 2023

pytorchmergebot removed the merging label Sep 20, 2023

masnesral added the topic: not user facing topic category label Sep 20, 2023

pytorchmergebot added the merging label Sep 20, 2023

pytorchmergebot added Merged and removed merging labels Sep 21, 2023

pytorchmergebot closed this in 4eada25 Sep 21, 2023

facebook-github-bot deleted the gh/masnesral/14/head branch September 25, 2023 14:25


		tuning_pool.terminate()

		def test_tuning_pool_multiple_devices(self):

[inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning #109500

[inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning #109500

Uh oh!

Conversation

masnesral commented Sep 18, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109500

✅ No Failures

Uh oh!

aakhundov Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

masnesral Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

aakhundov Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

eellison Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

masnesral Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

masnesral commented Sep 20, 2023

Uh oh!

pytorchmergebot commented Sep 20, 2023

Merge failed

Uh oh!

masnesral commented Sep 20, 2023

Uh oh!

pytorchmergebot commented Sep 20, 2023

Merge started

Uh oh!

pytorchmergebot commented Sep 21, 2023

Uh oh!

masnesral commented Sep 21, 2023

Uh oh!

pytorchmergebot commented Sep 21, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

masnesral commented Sep 18, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 18, 2023 •

edited

Loading