[c10d] allow sub group to be eagerly inited even if default one is not #138665

shuqiangzhang · 2024-10-23T01:12:32Z

Stack from ghstack (oldest at bottom):

Summary:
Currently, eager mode is applied either to all PGs or NONE of them.
There are cases where we don't want to initialize the comms for default
PG, but we still want to initialize the comms for sub PG. Now with a
device_id passed to new group, we can achieve this case
Test Plan:
newly added UT

Tags:

Resolves #137018

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: [ghstack-poisoned]

pytorch-bot · 2024-10-23T01:12:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138665

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2b515fd with merge base 8aedc64 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: ghstack-source-id: b4e77ef Pull Request resolved: #138665

kwen2501 · 2024-10-23T18:02:55Z

I agree with the addition of device_id to the new_group API.
(I wonder if we should start calling the kwarg device instead of device_id -- it is actually a torch.device, rather than an id / index. Although, device_id makes the API aligned with init_process_group, so I am undecided.)

kwen2501 · 2024-10-23T18:11:21Z

Does this PR need to depend on #138518?

shuqiangzhang · 2024-10-23T20:33:31Z

Does this PR need to depend on #138518?

Yes. otherwise, eager mode is coupled with split logic, aka, if subgroup is eager inited, it has to use split, which requires default PG to also be eager inited, which is against the intention of this PR

shuqiangzhang · 2024-10-23T20:34:47Z

I agree with the addition of device_id to the new_group API. (I wonder if we should start calling the kwarg device instead of device_id -- it is actually a torch.device, rather than an id / index. Although, device_id makes the API aligned with init_process_group, so I am undecided.)

For the purpose of this PR, let's keep it consistent with the naming of init_process_group API? We could re-name all of them in other PRs.

kwen2501 · 2024-10-23T23:29:51Z

if subgroup is eager inited, it has to use split

A subgroup can be eagerly inited the same way as a default group is eagerly inited, without using split. I wonder if we could use that to implement this PR here?

That is:
(1) if parent group does not have bounded device id,
call ncclCommInitConfig eagerly for subgroup.
(2) if parent group has bounded device id,
call ncclCommSplit eagerly to create subgroup.

kwen2501 · 2024-10-23T23:31:45Z

We could re-name all of them in other PRs.

Once added this is a public argument. So there will be deprecation consequence if we'd like to change it in future. But again, I am okay with either name.

kwen2501 · 2024-10-23T23:52:48Z

let's keep it consistent with the naming of init_process_group API

Okay let's do that.

…t one is not" Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves #137018 cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 · 2024-10-24T17:34:16Z

torch/_C/_distributed_c10d.pyi

    def rank(self) -> int: ...
    def size(self) -> int: ...
    def eager_connect_single_device(self, device: torch.device | None) -> None: ...
+    def is_initialized(self) -> bool: ...


What is this new method for? If it is for user, we can expose it; if it is just for testing, I think we should defer it.

It was Intended for both actually. It's a safer way to check if a PG is fully initialized and ready to split. Now we only allow split from eager init PG. But users can actually split a new PG from a non eager init PG if its is_initialized is true

kwen2501 · 2024-10-24T18:01:01Z

torch/distributed/distributed_c10d.py

+        device_id (torch.device, optional): a single, specific device
+            to "bind" this process to, allowing for backend-specific
+            optimizations.  only under NCCL: the communicator is immediately formed
+            (calling``ncclCommInit*`` immediately rather than the normal lazy
+            call)


This piece of documentation seems to come from init_process_group. I wonder if we could update them now that our view about device_id is clearer? For example:

a single, specific device to "bind" this process to. The `new_group` call will try to initialize a communication backend for the device if this field is given.

kwen2501 · 2024-10-24T18:02:38Z

torch/distributed/distributed_c10d.py

+    if device_id is None:
+        device_id = default_pg.bound_device_id


the current logic seems fine. Shall we also do a check here to make sure the given device_id is the same as default_pg.bound_device_id if they are both not None?

kwen2501 · 2024-10-24T19:05:42Z

torch/csrc/distributed/c10d/Backend.hpp

+  // whether the backend is fully initialized, e.g., for NCCL, if the NCCL comms
+  // are fully initialized and ready to use.
+  virtual bool isInitialized() {
+    return false;
+  }
+


Could it be a bit unsafe to assume a default value here?
Maybe we should throw an unimplemented error here and force backends to implement it? But it will induce more work, including contacting 3rd-party backends to implement this.

If this is problemm We could move this interface to nccl backend only, similar to other APIs such as bound_device_id or abort

…t one is not" Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves #137018 cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: ghstack-source-id: 914023b Pull Request resolved: #138665

kwen2501

LGTM overall. The comments are minor.

kwen2501 · 2024-10-24T21:40:56Z

test/distributed/test_c10d_nccl.py

+
+        tensor = torch.full((1,), self.rank).cuda(device)
+        new_group = c10d.new_group([0, 1], device_id=device)
+        self.assertEqual(backend.comm_split_count(), 0)


nit: I propose we stop using comm_split_count for testing.

This one is widely used on split related tests, any alternatives?

No alternatives. Just proposing that we stop using it.

Then, not using it is basically equivalent to removing valid python tests and safety guards to future code changes. Unless we have an alternative, e.g., fully trusted c++ tests (but still missing the e2e py test), don't think it is a good idea to remove all of its usages

kwen2501 · 2024-10-24T21:41:08Z

test/distributed/test_c10d_nccl.py

+        new_backend = new_group._get_backend(torch.device(device))
+        self.assertEqual(new_backend._is_initialized(), True)
+        dist.broadcast(tensor, 0, group=new_group)
+        self.assertEqual(new_backend.comm_split_count(), 0)


kwen2501 · 2024-10-24T21:41:47Z

test/distributed/test_c10d_nccl.py

+        dist.broadcast(tensor, 0, group=new_group)
+        self.assertEqual(new_backend.comm_split_count(), 0)
+        self.assertEqual(backend._is_initialized(), False)
+        torch.cuda.synchronize()


nit: is this synchronize necessary?

Yes, because things could be aborted before collective is completed

shuqiangzhang · 2024-10-24T23:49:28Z

@pytorchbot merge -f "no failures"

pytorchmergebot · 2024-10-24T23:51:19Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

shuqiangzhang mentioned this pull request Oct 23, 2024

[c10d] user to explicitly specify whether split semantics should be used in new_group #138518

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 23, 2024

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 23, 2024

shuqiangzhang requested review from H-Huang, c-p-i-o, fduwjj, kwen2501 and wconstab and removed request for wconstab October 23, 2024 01:12

kwen2501 reviewed Oct 24, 2024

View reviewed changes

kwen2501 approved these changes Oct 24, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 24, 2024

pytorchmergebot added the Merged label Oct 24, 2024

pytorchmergebot closed this in 4c91481 Oct 24, 2024

pytorchmergebot removed the merging label Oct 24, 2024

github-actions bot deleted the gh/shuqiangzhang/54/head branch November 24, 2024 02:14

[c10d] allow sub group to be eagerly inited even if default one is not #138665

[c10d] allow sub group to be eagerly inited even if default one is not #138665

Uh oh!

Conversation

shuqiangzhang commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138665

✅ No Failures

Uh oh!

kwen2501 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Oct 23, 2024

Uh oh!

shuqiangzhang commented Oct 23, 2024

Uh oh!

shuqiangzhang commented Oct 23, 2024

Uh oh!

kwen2501 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Oct 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuqiangzhang commented Oct 23, 2024 •

edited

Loading

pytorch-bot bot commented Oct 23, 2024 •

edited

Loading

kwen2501 commented Oct 23, 2024 •

edited

Loading

kwen2501 commented Oct 23, 2024 •

edited

Loading

kwen2501 commented Oct 23, 2024 •

edited

Loading

shuqiangzhang Oct 24, 2024 •

edited

Loading