[C10D] Document barrier interaction with device_id #159389

wconstab · 2025-07-29T18:09:00Z

Stack from ghstack (oldest at bottom):

-> [C10D] Document barrier interaction with device_id #159389

Addresses #159262

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta

Addresses #159262 [ghstack-poisoned]

pytorch-bot · 2025-07-29T18:09:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159389

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Failing to create ec2 instances -> lots of queueing

✅ You can merge normally! (2 Unrelated Failures)

As of commit 211b052 with merge base 31b3b38 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Addresses #159262 ghstack-source-id: 04dca5a Pull Request resolved: #159389

malfet · 2025-07-31T22:32:12Z

torch/distributed/distributed_c10d.py


    .. note:: `ProcessGroupNCCL` now blocks the cpu thread till the completion of the barrier collective.
+    .. warning:: `ProcessGroupNCCL` implements barrier as an all_gather of a 1-element tensor. This tensor will be
+       allocated on the device specified by 'device_ids' arg if specified, or the device set with


device_ids sounds like a plural, i.e. could be list/tuple. Do you want to clarify that it will be allocated on first (or one of )devices listed in device_ids

the device set with torch.cuda.set_deviceThis sounds a bit confusing to me, especially for multithreaded apps, where each thread can have their own default device. May be it should say something oror on the current device, which for given thread can be queried using torch.cuda.get_device or altered using torch.cuda.set_device API or torch.device context manager

yea, to be honest i have no idea what is going on with that argument. Specifically, why it is a list in the first place.

@kwen2501 do you know?

turns out i was incorrect about my understanding of how barrier selects device, it has been improved since i last saw it. I updated the doc to reflect what I see in ProcessGroupNCCL::guessDeviceId

I think this is vestige of when we used to support multigpu collectives (one process using multiple GPUs) e.g. #85961. But now, our main assumption is 1 process = 1 GPU

Addresses #159262 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

Addresses #159262 ghstack-source-id: 424ead1 Pull Request resolved: #159389

torch/distributed/distributed_c10d.py

Addresses #159262 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

Addresses #159262 ghstack-source-id: 76ecd94 Pull Request resolved: #159389

H-Huang

docs look good!

kwen2501

The comment looks good to me!

wconstab · 2025-08-01T00:14:11Z

@pytorchbot merge

pytorchmergebot · 2025-08-01T00:16:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-01T06:14:57Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

wconstab · 2025-08-01T18:10:02Z

@pytorchbot merge

pytorchmergebot · 2025-08-01T18:11:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[C10D] Document barrier interaction with device_id

e0fd55f

Addresses #159262 [ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 29, 2025

wconstab added a commit that referenced this pull request Jul 29, 2025

[C10D] Document barrier interaction with device_id

06898f6

Addresses #159262 ghstack-source-id: 04dca5a Pull Request resolved: #159389

malfet approved these changes Jul 31, 2025

View reviewed changes

Update on "[C10D] Document barrier interaction with device_id"

31dfb19

Addresses #159262 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

wconstab added a commit that referenced this pull request Jul 31, 2025

[C10D] Document barrier interaction with device_id

bb44768

Addresses #159262 ghstack-source-id: 424ead1 Pull Request resolved: #159389

H-Huang reviewed Jul 31, 2025

View reviewed changes

torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved

Update on "[C10D] Document barrier interaction with device_id"

211b052

Addresses #159262 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]

wconstab added a commit that referenced this pull request Jul 31, 2025

[C10D] Document barrier interaction with device_id

c90b908

Addresses #159262 ghstack-source-id: 76ecd94 Pull Request resolved: #159389

H-Huang approved these changes Jul 31, 2025

View reviewed changes

kwen2501 approved these changes Jul 31, 2025

View reviewed changes

fduwjj approved these changes Jul 31, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 1, 2025

pytorchmergebot added the merging label Aug 1, 2025

pytorchmergebot closed this in dd22ba0 Aug 1, 2025

pytorchmergebot added Merged and removed merging labels Aug 1, 2025

github-actions bot deleted the gh/wconstab/433/head branch September 1, 2025 02:19

[C10D] Document barrier interaction with device_id #159389

[C10D] Document barrier interaction with device_id #159389

Uh oh!

Conversation

wconstab commented Jul 29, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159389

❗ 1 Active SEVs

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

malfet Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 1, 2025

Uh oh!

wconstab commented Aug 1, 2025

Uh oh!

pytorchmergebot commented Aug 1, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wconstab commented Jul 29, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 29, 2025 •

edited

Loading