-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[C10D] Document barrier interaction with device_id #159389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Addresses #159262 [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159389
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (2 Unrelated Failures)As of commit 211b052 with merge base 31b3b38 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| .. note:: `ProcessGroupNCCL` now blocks the cpu thread till the completion of the barrier collective. | ||
| .. warning:: `ProcessGroupNCCL` implements barrier as an all_gather of a 1-element tensor. This tensor will be | ||
| allocated on the device specified by 'device_ids' arg if specified, or the device set with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_ids sounds like a plural, i.e. could be list/tuple. Do you want to clarify that it will be allocated on first (or one of )devices listed in device_ids
the device set withtorch.cuda.set_deviceThis sounds a bit confusing to me, especially for multithreaded apps, where each thread can have their own default device. May be it should say something oror on the current device, which for given thread can be queried usingtorch.cuda.get_deviceor altered usingtorch.cuda.set_deviceAPI ortorch.devicecontext manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, to be honest i have no idea what is going on with that argument. Specifically, why it is a list in the first place.
@kwen2501 do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out i was incorrect about my understanding of how barrier selects device, it has been improved since i last saw it. I updated the doc to reflect what I see in ProcessGroupNCCL::guessDeviceId
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is vestige of when we used to support multigpu collectives (one process using multiple GPUs) e.g. #85961. But now, our main assumption is 1 process = 1 GPU
Addresses #159262 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]
Addresses #159262 cc H-Huang awgu wanchaol fegin fduwjj wz337 d4l3k pragupta [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs look good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment looks good to me!
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Addresses #159262
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @pragupta