-
Notifications
You must be signed in to change notification settings - Fork 25.7k
API to retrieve default distributed backend from device #140536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140536
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 7741dfa with merge base 740d1eb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@kwen2501 : can you please help with the review. thanks |
46a8fbb to
8c73b47
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
8c73b47 to
770b809
Compare
770b809 to
4b9a106
Compare
|
@kwen2501 : can you please help with the review and approval. |
|
On it, sorry |
|
Overall looks good to me. To follow up, we can think of a way to register This is the UX we want to get to: |
|
One easy way to do registration is to add an entry here: pytorch/torch/distributed/distributed_c10d.py Lines 254 to 258 in 7ced49d
That is, in your package's We can later add a formal registration API too. wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm. Please see comments above on how this can be integrated with new c10d capability.
3e02e2b to
c4e7727
Compare
|
Thanks @kwen2501 for your comment, your change with #140963 will be helpful. Regarding your question on HPU registration, it is done by calling the
I have modified the code accordingly to get the backend string directly using: Let me know your views. |
c4e7727 to
f33bca2
Compare
f33bca2 to
bd4775d
Compare
|
@kwen2501 : can you please help with the approval, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
bd4775d to
d66ebaf
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
d66ebaf to
7741dfa
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
# Motivation The distributed APIs rely on backend names for creation of process group. To abstract out references of these names from PG creation, an API is added to get default distributed backend for device. The device code would need to register its device and backend via ```torch.distributed.Backend.register_backend``` or update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ``` prior to using the API. An example of use is added in the test file ( which can be used to check abstracted APIs) Pull Request resolved: pytorch#140536 Approved by: https://github.com/kwen2501
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536). This should result in significant improvement in usability for all devices Pull Request resolved: #145222 Approved by: https://github.com/kwen2501
…ch#145222) In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (pytorch#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (pytorch#140536). This should result in significant improvement in usability for all devices Pull Request resolved: pytorch#145222 Approved by: https://github.com/kwen2501
Motivation
The distributed APIs rely on backend names for creation of process group.
To abstract out references of these names from PG creation, an API is added to get default distributed backend for device.
The device code would need to register its device and backend via
torch.distributed.Backend.register_backendor update the maptorch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend"prior to using the API.An example of use is added in the test file ( which can be used to check abstracted APIs)
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o