API to retrieve default distributed backend from device #140536

ankurneog · 2024-11-13T10:45:03Z

Motivation

The distributed APIs rely on backend names for creation of process group.
To abstract out references of these names from PG creation, an API is added to get default distributed backend for device.
The device code would need to register its device and backend via torch.distributed.Backend.register_backend or update the map torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" prior to using the API.

An example of use is added in the test file ( which can be used to check abstracted APIs)

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2024-11-13T10:45:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140536

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 7741dfa with merge base 740d1eb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ankurneog · 2024-11-13T10:45:49Z

@kwen2501 : can you please help with the review. thanks

test/distributed/test_backends.py

ankurneog · 2024-11-18T04:05:02Z

@pytorchbot rebase

pytorchmergebot · 2024-11-18T04:06:32Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-18T04:06:35Z

Successfully rebased distributed_api onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout distributed_api && git pull --rebase)

ankurneog · 2024-11-19T15:48:24Z

@kwen2501 : can you please help with the review and approval.

kwen2501 · 2024-11-19T15:59:35Z

On it, sorry

kwen2501 · 2024-11-19T16:23:46Z

Overall looks good to me.
Related, we also want to waive the need for backend specification when calling init_process_group. Here is a PR enabling that: #140963.

To follow up, we can think of a way to register "hpu": "hccl" mapping with torch c10d. wdyt?
To kick start, it would be nice if you could share 1) where ProcessGroupHCCL is packaged today; and 2) how your user import that package. Ideally, we want the registration to happen automatically during the import so that user does not need to get involved.

This is the UX we want to get to:

device = torch.Device("hpu", rank % device_count)
dist.init_process_group(device_id=device)

kwen2501 · 2024-11-19T16:31:14Z

One easy way to do registration is to add an entry here:

pytorch/torch/distributed/distributed_c10d.py

Lines 254 to 258 in 7ced49d

    
           # 3rd-party devices can register the default backend support here 
        
           default_device_backend_map: Dict[str, str] = { 
        
               "cpu": GLOO, 
        
               "cuda": NCCL, 
        
           }

That is, in your package's __init__.py, you can add a line like this:

torch.distributed.Backend.default_device_backend_map["hpu"] = "hccl"

We can later add a formal registration API too. wdyt?

kwen2501

Overall lgtm. Please see comments above on how this can be integrated with new c10d capability.

ankurneog · 2024-11-20T04:20:16Z

Thanks @kwen2501 for your comment, your change with #140963 will be helpful.

Regarding your question on HPU registration, it is done by calling the register_backend API as follows :
torch.distributed.Backend.register_backend("hccl", _create_process_group_hccl, devices=["hpu"], extended_api=True)

register_backend ensures that the mapping for hpu to hccl is done :
Backend.default_device_backend_map[device] = name.lower()

I have modified the code accordingly to get the backend string directly using:
Backend.default_device_backend_map.get(device_str)

Let me know your views.

ankurneog · 2024-11-21T20:00:51Z

@kwen2501 : can you please help with the approval, thanks

kwen2501

LGTM.

guangyey · 2024-11-22T04:33:05Z

@pytorchbot merge

pytorchmergebot · 2024-11-22T04:35:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-11-22T04:45:52Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3-clang12-mobile-build / build

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

guangyey · 2024-11-22T04:56:00Z

@pytorchbot rebase

pytorchmergebot · 2024-11-22T04:57:27Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-22T04:57:29Z

Successfully rebased distributed_api onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout distributed_api && git pull --rebase)

ankurneog · 2024-11-22T09:51:01Z

@pytorchbot merge

pytorchmergebot · 2024-11-22T09:52:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

# Motivation The distributed APIs rely on backend names for creation of process group. To abstract out references of these names from PG creation, an API is added to get default distributed backend for device. The device code would need to register its device and backend via ```torch.distributed.Backend.register_backend``` or update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ``` prior to using the API. An example of use is added in the test file ( which can be used to check abstracted APIs) Pull Request resolved: pytorch#140536 Approved by: https://github.com/kwen2501

In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536). This should result in significant improvement in usability for all devices Pull Request resolved: #145222 Approved by: https://github.com/kwen2501

…ch#145222) In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic. These changes will include the following approaches to do the same : - Allowing for multiple device types using instantiate_device_type_test - Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies - Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (pytorch#138216) wherever applicable - Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (pytorch#140536). This should result in significant improvement in usability for all devices Pull Request resolved: pytorch#145222 Approved by: https://github.com/kwen2501

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 13, 2024

pytorchbot added the open source label Nov 13, 2024

ezyang requested a review from kwen2501 November 14, 2024 03:22

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 14, 2024

guangyey reviewed Nov 15, 2024

View reviewed changes

test/distributed/test_backends.py Outdated Show resolved Hide resolved

ankurneog force-pushed the distributed_api branch from 46a8fbb to 8c73b47 Compare November 18, 2024 04:04

ankurneog requested a review from guangyey November 18, 2024 04:05

pytorchmergebot force-pushed the distributed_api branch from 8c73b47 to 770b809 Compare November 18, 2024 04:06

ankurneog force-pushed the distributed_api branch from 770b809 to 4b9a106 Compare November 18, 2024 10:45

ankurneog changed the title ~~API to retrieve default backend from device~~ API to retrieve default distributed backend from device Nov 19, 2024

ankurneog mentioned this pull request Nov 19, 2024

Tests Generelization for multiple accelerator devices #139749

Closed

kwen2501 reviewed Nov 19, 2024

View reviewed changes

ankurneog force-pushed the distributed_api branch 2 times, most recently from 3e02e2b to c4e7727 Compare November 20, 2024 04:13

ankurneog force-pushed the distributed_api branch from c4e7727 to f33bca2 Compare November 20, 2024 06:51

ankurneog requested a review from kwen2501 November 21, 2024 03:00

ankurneog force-pushed the distributed_api branch from f33bca2 to bd4775d Compare November 21, 2024 05:54

kwen2501 approved these changes Nov 21, 2024

View reviewed changes

pytorchmergebot force-pushed the distributed_api branch from bd4775d to d66ebaf Compare November 22, 2024 00:57

pytorchmergebot added the merging label Nov 22, 2024

pytorchmergebot removed the merging label Nov 22, 2024

use Backend.default_device_map to retrieve backend from device

7741dfa

pytorchmergebot force-pushed the distributed_api branch from d66ebaf to 7741dfa Compare November 22, 2024 04:57

ankurneog mentioned this pull request Nov 22, 2024

Add initial support for intel Gaudi accelerators sgl-project/sglang#2121

Merged

pytorchmergebot added the merging label Nov 22, 2024

pytorchmergebot added the Merged label Nov 22, 2024

pytorchmergebot closed this in f497a00 Nov 22, 2024

pytorchmergebot removed the merging label Nov 22, 2024

ankurneog deleted the distributed_api branch November 22, 2024 11:03

This was referenced Jan 16, 2025

Replacing explicit backend search with api call #144944

Closed

Refactoring Distributed test cases to be device agnostic [1/n] #145222

Closed

ankurneog mentioned this pull request Jan 24, 2025

Added dist utility API to get backend from a device object #132735

Closed

ankurneog mentioned this pull request Mar 5, 2025

[RFC] Generalize pytorch content for non-native device execution pytorch/rfcs#66

Open

AnantGulati mentioned this pull request Mar 17, 2025

Refactoring Distributed test cases to be device agnostic [2/n] #149317

Closed

ankurneog mentioned this pull request May 26, 2025

[RFC] : Remove Explicit Backend References from torch.distributed (c10d) #154345

Open

API to retrieve default distributed backend from device #140536

API to retrieve default distributed backend from device #140536

Uh oh!

Conversation

ankurneog commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

pytorch-bot bot commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140536

❗ 1 Active SEVs

✅ No Failures

Uh oh!

ankurneog commented Nov 13, 2024

Uh oh!

Uh oh!

ankurneog commented Nov 18, 2024

Uh oh!

pytorchmergebot commented Nov 18, 2024

Uh oh!

pytorchmergebot commented Nov 18, 2024

Uh oh!

ankurneog commented Nov 19, 2024

Uh oh!

kwen2501 commented Nov 19, 2024

Uh oh!

kwen2501 commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

ankurneog commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankurneog commented Nov 21, 2024

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 22, 2024

Merge failed

Uh oh!

guangyey commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

ankurneog commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ankurneog commented Nov 13, 2024 •

edited

Loading

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading

kwen2501 commented Nov 19, 2024 •

edited

Loading

kwen2501 commented Nov 19, 2024 •

edited

Loading

ankurneog commented Nov 20, 2024 •

edited

Loading