KEMBAR78
Opportunistically use `ncclCommSplit` when creating new NCCL groups by chipturner · Pull Request #112889 · pytorch/pytorch · GitHub
Skip to content

Conversation

@chipturner
Copy link
Contributor

Currently ncclCommInitRankConfig is always used when creating new
communicator groups. This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when dist.new_group is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 3, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112889

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d90c747 with merge base 4b7f9fa (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 3, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: chipturner / name: Chip Turner (d90c747)

@chipturner chipturner marked this pull request as ready for review November 3, 2023 19:22
@chipturner chipturner force-pushed the commsplit branch 2 times, most recently from 0ea2d4b to 0ceb454 Compare November 6, 2023 02:06
Comment on lines 1201 to 1203
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronized between? Can the comment specify it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, we don't even need this anymore. removing.

@kwen2501
Copy link
Contributor

kwen2501 commented Nov 6, 2023

I wonder if we would eventually need to expose the "color" via one/some of the new_group API family (new_group, new_subgroups, etc).

new_subgroups might be a nice candidate. Because its current definition of: it creates intra-machine subgroups, where each of which contains all the ranks of a machine, based on the assumption that each machine has the same number of devices is unsound. (In general, we cannot assume two ranks are on the same machine without doing detection. And detection on PyTorch level (esp python) would be too non-abstract).

@chipturner chipturner force-pushed the commsplit branch 7 times, most recently from c815b98 to b0adac9 Compare November 13, 2023 21:07
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I just have 3 minor comments.

@kwen2501
Copy link
Contributor

I resolved the three last comments. I believe we are good to go. Thanks!

@chipturner chipturner force-pushed the commsplit branch 2 times, most recently from 7fcf3cd to be8a570 Compare November 20, 2023 20:26
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called and the new group includes all ranks, it will use the existing,
healthy world process group to select the right ranks to include in
the process group.
@chipturner
Copy link
Contributor Author

@pytorchmergebot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@chipturner chipturner deleted the commsplit branch November 21, 2023 21:59
@huydhn
Copy link
Contributor

huydhn commented Nov 22, 2023

@pytorchbot revert -m 'Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk https://hud.pytorch.org/pytorch/pytorch/commit/4d07428edee863e7f5920f0672957a9711a9f0b5' -c nosignal

The error is 'torch._C._distributed_c10d.Options' object has no attribute 'split_from'. Please take a look and reland the change. I will add ciflow/periodic to run the test on your pull request.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@chipturner your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Nov 22, 2023
…groups (#112889)"

This reverts commit 64a5372.

Reverted #112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk https://hud.pytorch.org/pytorch/pytorch/commit/4d07428edee863e7f5920f0672957a9711a9f0b5 ([comment](#112889 (comment)))
@pytorchmergebot
Copy link
Collaborator

Reverting PR 112889 failed

Reason: HTTP Error 422: Unprocessable Entity

Details for Dev Infra team Raised by workflow job

chipturner added a commit to chipturner/pytorch that referenced this pull request Nov 22, 2023
…w NCCL groups (pytorch#112889)

Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups.  This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.

This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.

Original PR: pytorch#112889

Reverted due to ROCm incompatibility
pytorchmergebot pushed a commit that referenced this pull request Nov 23, 2023
…NCCL groups) (#114385)

- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`

Fixes cause of revert of original PR

Pull Request resolved: #114385
Approved by: https://github.com/huydhn
pytorchmergebot pushed a commit that referenced this pull request Nov 23, 2023
…114423)

When reopening a reverted PR, `422: Unprocessable Entity` is returned when the head branch has been deleted, for example #112889 (comment)

```
{
  "message": "Validation Failed",
  "errors": [
    {
      "resource": "PullRequest",
      "code": "custom",
      "field": "state",
      "message": "state cannot be changed. The commsplit branch has been deleted."
    }
  ],
  "documentation_url": "https://docs.github.com/rest/pulls/pulls#update-a-pull-request"
}
```

The revert still happens though, only reopening PR fails, which is ok to ignore in this case I think instead of going the complicated route of trying to restore the deleted branch by merge bot.
Pull Request resolved: #114423
Approved by: https://github.com/malfet, https://github.com/kit1980
@mengpenghui
Copy link

@chipturner Can you provide a pytorch test case to quantify the performance improvement brought by comm_split?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants