[c10d] Remove deprecated multi-gpu-per-thread APIs #114156

kwen2501 · 2023-11-20T19:25:45Z

As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

cc @ezyang @gchanan

As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

pytorch-bot · 2023-11-20T19:25:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114156

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ef380df with merge base 140c54e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

public api doc change sounds good to me.

H-Huang

Looks good, I think there is a lot of follow up work that can be done (BE?)

We can probably comb through a lot of util functions and find which has an assumption for multiple GPUs and remove that logic to simplify our code. For example, in PGnccl getDeviceList() only makes sense for multigpus and we have a lot of, now unnecessary, logic of looping through devices.

H-Huang · 2023-11-20T19:40:44Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

  seq_++;

-  // Currently, the API permits two scenarios where inputs.size() and
+  // Currently, the API permits one scenario where inputs.size() and


Should we create a follow up task for removing the vector arguments for all collectives (e.g. std::vector<at::Tensor>& to at::Tensor&)? The only reason vector was added is for multi-gpu collectives right?

There is also the _coalesced python APIs (the "one scenario" left referred here).
And yeah, we'd need to think of a strategic way to remove them.

kwen2501 · 2023-11-20T20:23:26Z

@H-Huang correct, there are a lot of Better Engineering to do. This PR just removes the user-facing APIs, as a starting point. Next we could go through the backend implementation in ProcessGroupNCCL.cpp.

kwen2501 · 2023-11-20T20:24:31Z

Adding suppress-bc-linter label because the bc break here is intentional.

fduwjj · 2023-11-20T20:27:50Z

torch/testing/_internal/common_distributed.py

    nGPUs = torch.cuda.device_count()
    visible_devices = range(nGPUs)

-    if backend == "nccl":


Are we going to remove init_multigpu_helper completely?

Likely not. It is used in a lot of places (even non-distributed tests). So I let it stay, because other tests may need it for other purpose.

fduwjj

LGTM

kwen2501 · 2023-11-21T01:24:33Z

@pytorchbot merge

pytorchmergebot · 2023-11-21T01:26:26Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kit1980 · 2023-11-30T20:21:56Z

This has caused multiple issues in the Meta-internal pipelines.
In general we should make sure important internal usages are updated before removal: this can be done with the help from https://github.com/pytorch-labs/torchfix

kwen2501 requested a review from albanD as a code owner November 20, 2023 19:25

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 20, 2023

albanD approved these changes Nov 20, 2023

View reviewed changes

H-Huang reviewed Nov 20, 2023

View reviewed changes

kwen2501 added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Nov 20, 2023

fduwjj reviewed Nov 20, 2023

View reviewed changes

fduwjj approved these changes Nov 20, 2023

View reviewed changes

H-Huang approved these changes Nov 20, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2023

pytorchmergebot added the merging label Nov 21, 2023

pytorchmergebot added Merged and removed merging labels Nov 21, 2023

pytorchmergebot closed this in dc65f6c Nov 21, 2023

kit1980 added module: deprecation topic: deprecation topic category labels Nov 28, 2023

kit1980 added module: bc-breaking Related to a BC-breaking change topic: bc breaking topic category topic: bc_breaking labels Nov 30, 2023

github-actions bot deleted the remove_multigpu_apis branch February 19, 2024 01:59

[c10d] Remove deprecated multi-gpu-per-thread APIs #114156

[c10d] Remove deprecated multi-gpu-per-thread APIs #114156

Uh oh!

Conversation

kwen2501 commented Nov 20, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114156

✅ No Failures

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Nov 20, 2023

Uh oh!

kwen2501 commented Nov 20, 2023

Uh oh!

fduwjj Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

kwen2501 Nov 20, 2023

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Nov 21, 2023

Uh oh!

pytorchmergebot commented Nov 21, 2023

Merge started

Uh oh!

kit1980 commented Nov 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kwen2501 commented Nov 20, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 20, 2023 •

edited

Loading