-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] Support optional backend if device_id provided #140963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140963
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit af71423 with merge base ffb9790 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # >>> init_process_group() | ||
| # we set it to `undefined` and rely on lazy init. | ||
| if backend is None: | ||
| backend = "undefined" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does backend = "undefined" do? Is it going to throw an error inside Backend below? or it somehow finds a default one later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It later get translated into "cuda:nccl,cpu:gloo" IIRC.
I guess I can make it go away? This PR just didn't touch that aspect.
|
|
||
| backend_list = [UNDEFINED, GLOO, NCCL, UCC, MPI] | ||
|
|
||
| # 3rd-party devices can register the default backend support here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you expect users to modify this dict directly or we will also add register/unregister API for third-party devices and backends? There are multiple dicts here and I guess register/unregister APIs can make them consistent without users' awareness and also more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to consider to support privateusr1 device too. cc @shink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. We can provide registration APIs, which would ideally be called when the a third-party module is imported so that no user involvement is needed. Let me add it in a next PR.
|
Lgtm |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Citing @malfet's [comment](pytorch#136343 (review)) in pytorch#136343 > It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged. This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init). New user experience: ``` device = torch.device(device_type, rank % device_count) dist.init_process_group(device_id=device) ``` The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc. Pull Request resolved: pytorch#140963 Approved by: https://github.com/wconstab
Stack from ghstack (oldest at bottom):
Citing @malfet's comment in #136343
This PR makes the backend specification ("nccl", "gloo") optional when user provides a
devce_idtoinit_process_group(the acceptance ofdevice_idhas been previously supported for the purpose of eager init).New user experience:
The line of
device = torch.device(...)is anyway needed because user would use it for tensor creation etc.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @zhangxiaoli73 @Chao1Han @jgong5
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o