KEMBAR78
[inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning by masnesral · Pull Request #109500 · pytorch/pytorch · GitHub
Skip to content

Conversation

@masnesral
Copy link
Contributor

@masnesral masnesral commented Sep 18, 2023

Stack from ghstack (oldest at bottom):

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:

  • New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
  • Ran multiprocess autotuning and looked at the output from nvidia-smi pmon to make sure that all GPUs were assigned processes.

Snippet:

    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

…autotuning

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 18, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109500

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e997a24 with merge base d0cc623 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

specified single device. If device is None, don't manipulate the environment.
"""
if device is None:
yield
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double check my understanding: if the user sets CUDA_VISIBLE_DEVICES=7 on the parent python process running torch.compile and multi-device autotuning is disabled, we follow this code path, but the env var value should propagate to the child process. Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lemme spell out the different scenarios since it might not be completely straightforward and I want to make sure the behavior is what we'd want:

##
## Multi-device disabled; CUDA_VISIBLE_DEVICES not set --> also unset in (single) child process
##
TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 python ~/tune.py 2>&1 | grep "Entering TuningProcess child"

[2023-09-18 09:34:43,449] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = None

##
## Multi-device disabled; CUDA_VISIBLE_DEVICES=3 --> (single) child process also sees CUDA_VISIBLE_DEVICES=3
##
TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=3 python ~/tune.py 2>&1 | grep "Entering TuningProcess child"

[2023-09-18 09:35:15,198] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 3

##
## Multi-device ENABED; CUDA_VISIBLE_DEVICES not set --> Use all devices; one per child process
##
TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 python ~/tune.py 2>&1 | grep "Entering TuningProcess child"

[2023-09-18 09:35:46,227] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 1
[2023-09-18 09:35:46,301] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 2
[2023-09-18 09:35:46,443] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 0
[2023-09-18 09:35:46,552] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 7
[2023-09-18 09:35:46,604] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 3
[2023-09-18 09:35:46,623] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 5
[2023-09-18 09:35:46,657] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 4
[2023-09-18 09:35:46,659] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 6

##
## Multi-device ENABED; CUDA_VISIBLE_DEVICES=0,1,2 --> Use only the specified devices, one per child process
##
TORCH_LOGS=+torch._inductor.autotune_process TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_AUTOTUNE_MULTI_DEVICE=1 CUDA_VISIBLE_DEVICES=0,1,2 python ~/tune.py 2>&1 | grep "Entering TuningProcess child"

[2023-09-18 09:36:17,310] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 1
[2023-09-18 09:36:17,371] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 2
[2023-09-18 09:36:17,405] torch._inductor.autotune_process: [DEBUG] Entering TuningProcess child. Visible devices = 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation! This looks correct.

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good !


tuning_pool.terminate()

def test_tuning_pool_multiple_devices(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need requires_multigpu decorator here

…evice subprocess autotuning"

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

[ghstack-poisoned]
@masnesral masnesral changed the title [RFC][inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning [inductor] Set CUDA_VISIBLE_DEVICES for multi-device subprocess autotuning Sep 18, 2023
masnesral added a commit that referenced this pull request Sep 18, 2023
…uning

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

ghstack-source-id: dfb61f1
Pull Request resolved: #109500
@masnesral masnesral requested a review from eellison September 18, 2023 23:34
from torch._dynamo.backends.registry import register_backend
from torch._inductor.compile_fx import compile_fx, count_bytes_inner

import functools
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eellison, mind taking a quick look at the new changes to this file?

…ocess autotuning"

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 19, 2023
…uning

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

ghstack-source-id: f101faf
Pull Request resolved: #109500
…SIBLE_DEVICES for multi-device subprocess autotuning"

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 20, 2023
…uning

Summary: The curent parallel autotune implementation sets the CUDA_VISIBLE_DEVICES env var too late -- after the benchmarking subprocess has started -- and the torch libraries don't recognize the change. Since the multiprocessing library doesn't support providing an environment for the subprocess, temporarily set CUDA_VISIBLE_DEVICES in the parent process so that the change is inherited by the subprocess.

Test Plan:
* New unit test to verify the env var is set in the sub-process and fail the benchmark if it's not.
* Ran multiprocess autotuning and looked at the output from `nvidia-smi pmon` to make sure that all GPUs were assigned processes.

Snippet:
```
    1    3442314     C     2     1     -     -   python
    2    3442318     C     2     1     -     -   python
    3    3442320     C     8     2     -     -   python
    4    3442323     C     9     4     -     -   python
    5    3442325     C    10     4     -     -   python
    6    3442327     C    10     4     -     -   python
    7    3442329     C     2     0     -     -   python
    0    3434906     C     0     0     -     -   python
```

ghstack-source-id: 52fcf7a
Pull Request resolved: #109500
@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 20, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake, then you can re trigger it through pytorch-bot.

@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/masnesral/14/head branch September 25, 2023 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants