[c10d] Fix extra CUDA context created by barrier #152834

kwen2501 · 2025-05-05T16:24:51Z

In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses device_id given by user when calling init_process_group.

This PR also uses torch._C._get_accelerator() to determine the device type.

ghstack-source-id: 96c32b9565794d995c26bd1794856d1ef7961652
Pull Request resolved: #149144

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. ghstack-source-id: 96c32b9 Pull Request resolved: #149144

pytorch-bot · 2025-05-05T16:24:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152834

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

❌ 1 New Failure, 4 Cancelled Jobs, 1 Unrelated Failure

As of commit 99138ee with merge base 924a247 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.ephemeral.linux.2xlarge) (gh)
Process completed with exit code 1.

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ngimel

Please add test that no extra contexts are created?

ngimel · 2025-05-05T16:54:23Z

torch/distributed/distributed_c10d.py

-            )
+    # Detect the accelerator on the machine. If no accelerator is available, it
+    # returns CPU.
+    device = torch._C._get_accelerator()


_get_accelerator poisons the context on the current device, to just get the accelerator on the machine it's better to use _accelerator_getAccelerator.

Per @albanD torch.accelerator.current_accelerator() is also non-poisoning

albanD · 2025-05-05T18:38:16Z

torch/distributed/distributed_c10d.py

+        # may use default device 0, causing issues like hang or all processes
+        # creating context on device 0.
+        opts.device = device
+        warnings.warn(  # warn only once


Does it actually warn only once by default?

It does :)
https://stackoverflow.com/questions/22661745/why-only-one-warning-in-a-loop

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 5, 2025

kwen2501 mentioned this pull request May 5, 2025

[v2.7.1] Release Tracker #152627

Closed

ngimel reviewed May 5, 2025

View reviewed changes

albanD reviewed May 5, 2025

View reviewed changes

kwen2501 mentioned this pull request May 23, 2025

[c10d] Add more tests to prevent extra context #154179

Merged

atalman approved these changes May 27, 2025

View reviewed changes

atalman merged commit 1214198 into release/2.7 May 27, 2025
179 of 187 checks passed

atalman mentioned this pull request May 28, 2025

Release 2.7.1 validations checklist and cherry-picks #154512

Closed

49 tasks

kwen2501 mentioned this pull request May 28, 2025

The device_id parameter of distributed.init_process_group will cause each process to occupy video memory on the first accessible GPU #149119

Closed

github-actions bot deleted the bcon_2.7 branch June 27, 2025 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Fix extra CUDA context created by barrier #152834

[c10d] Fix extra CUDA context created by barrier #152834

kwen2501 commented May 5, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented May 5, 2025 •

edited

Loading

Uh oh!

ngimel left a comment

Uh oh!

ngimel May 5, 2025

Uh oh!

ngimel May 5, 2025

Uh oh!

albanD May 5, 2025

Uh oh!

kwen2501 May 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[c10d] Fix extra CUDA context created by barrier #152834

[c10d] Fix extra CUDA context created by barrier #152834

Conversation

kwen2501 commented May 5, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152834

❗ 1 Active SEVs

❌ 1 New Failure, 4 Cancelled Jobs, 1 Unrelated Failure

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel May 5, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel May 5, 2025

Choose a reason for hiding this comment

Uh oh!

albanD May 5, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented May 5, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 5, 2025 •

edited

Loading