Detect accelerator type when backend is not specified #142216

kwen2501 · 2024-12-06T03:48:39Z

Stack from ghstack (oldest at bottom):

-> Detect accelerator type when backend is not specified #142216

Today, when user does init_process_group(), without backend or device_id specification, we would auto-translate it into cuda:nccl,cpu:gloo. The idea was to initialize all default backends to cover what the user may do later.

A side effect is increase of initialization time and resources.

This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-12-06T03:48:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142216

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bf1aec7 with merge base 61dc5e9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: f33839e Pull Request resolved: #142216

kwen2501 · 2024-12-06T07:06:52Z

@pytorchbot merge

pytorchmergebot · 2024-12-06T07:08:40Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all **default** backends to cover what the user may do later. A side effect is increase of initialization time and resources. This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator. Pull Request resolved: pytorch#142216 Approved by: https://github.com/wconstab, https://github.com/XilunWu

Update doc to reflect change brought by #142216 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Update doc to reflect change brought by #142216 Pull Request resolved: #142404 Approved by: https://github.com/XilunWu

inconsistent with the logic introduced in #162157 and modified in #142216.This update ensures the documentation matches the actual behavior of the code. Pull Request resolved: #162158 Approved by: https://github.com/wconstab

inconsistent with the logic introduced in pytorch#162157 and modified in pytorch#142216.This update ensures the documentation matches the actual behavior of the code. Pull Request resolved: pytorch#162158 Approved by: https://github.com/wconstab

Detect accelerator type when backend is not specified

bf1aec7

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Dec 6, 2024

Detect accelerator type when backend is not specified

76d5958

ghstack-source-id: f33839e Pull Request resolved: #142216

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 6, 2024

kwen2501 requested review from H-Huang, wconstab and wz337 December 6, 2024 03:52

wconstab approved these changes Dec 6, 2024

View reviewed changes

XilunWu approved these changes Dec 6, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 6, 2024

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot added the Merged label Dec 6, 2024

pytorchmergebot closed this in cc64ad6 Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

huydhn mentioned this pull request Dec 9, 2024

DISABLED test_init_process_group (__main__.DeviceMeshTest) #142361

Closed

kwen2501 mentioned this pull request Dec 9, 2024

[c10d] Update backend arg documentation #142404

Closed

kwen2501 added a commit that referenced this pull request Dec 9, 2024

Update on "[c10d] Update backend arg documentation"

50987f8

Update doc to reflect change brought by #142216 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 added a commit that referenced this pull request Dec 9, 2024

Update on "[c10d] Update backend arg documentation"

cbf4131

Update doc to reflect change brought by #142216 cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Dec 9, 2024

[c10d] Update backend arg documentation (#142404)

452e1a7

Update doc to reflect change brought by #142216 Pull Request resolved: #142404 Approved by: https://github.com/XilunWu

github-actions bot deleted the gh/kwen2501/111/head branch January 6, 2025 02:08

haochen-shen mentioned this pull request Sep 4, 2025

The code comments are inconsistent with the source code logic in distributed_c10d.py #162157

Open

Codeboi007 mentioned this pull request Sep 4, 2025

Fixed comment to match logic in distributed_c10d.py #162158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect accelerator type when backend is not specified #142216

Detect accelerator type when backend is not specified #142216

Uh oh!

kwen2501 commented Dec 6, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 6, 2024 •

edited

Loading

Uh oh!

kwen2501 commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Detect accelerator type when backend is not specified #142216

Detect accelerator type when backend is not specified #142216

Uh oh!

Conversation

kwen2501 commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142216

✅ No Failures

Uh oh!

kwen2501 commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Dec 6, 2024 •

edited

Loading

pytorch-bot bot commented Dec 6, 2024 •

edited

Loading