-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[C10D] Support group_dst/group_src in c10d send/recv #140460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140460
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit f558d56 with merge base de34f58 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. ghstack-source-id: 80af56e Pull Request resolved: #140460
# TODO if we want to encourage migration to 'group_dst' arg and deprecate dst arg, should we support using 'group_dst' | ||
# even when 'group' is None (for default group)? | ||
assert group is not None, "Must specify group if using group_dst" | ||
dst = get_global_rank(group, group_dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, this is dumb. i just realized that i can't implement group_dst as a function of global rank, since part of the reason for this PR is that 'group' might not be associated with the default-group at all, so get_global_rank function may not work. We will need to invert the logic below so that 'group_dst' is the thing that we directly pass into the group api, and we convert dst into group_dst at the top if dst is passed in. this will make the change a bit more invasive but still possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a global to group map mapping: _world.pg_group_ranks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i realize that but the point is we want to not rely on any global stuff when using group_ argument. This is to enable use cases where someone creates a ProcessGroup object directly and it's not registered with us.
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. ghstack-source-id: 72264a2 Pull Request resolved: #140460
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good to me
Can we add some unit tests on this new logic?
group_src_rank = get_group_rank(pg, src) | ||
pg.recv([tensor], group_src_rank, tag).wait() | ||
return src | ||
group_src = _canonicalize_group_rank(group, src, group_src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use _check_not_self_rank(group, group_src, "source")
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, no, we apparently can't. It causes this test to fail
python test/distributed/test_distributed_spawn.py TestDistBackendWithSpawn.test_batch_isend_irecv_nccl
The test looks kinda dumb, i'm not sure if it really even intends to send/recv to itself, but that's what it does. And I think (guess?) that nccl supports this so we probably shouldn't start asserting against it now...
lg! maybe n00b question, what could be a use case for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot.
Do you mind extending the change to isend
, irecv
, broadcast
and reduce
too?
Can be in a next PR.
Thanks!
if group_rank is not None: | ||
assert global_rank is None, "Can't specify both group_rank and global_rank" | ||
else: | ||
assert global_rank is not None, "Must specify global_rank or group_rank" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: raise ValueError instead? assert can be optimized away with -O
flag.
Well, first I wanted this when developing pipelining. We often want to send to our next pipeline rank using the pp group but we are forced to convert our pp rank to global rank which is inconvenient. Second the RFC0042 issue came up and using local ranks is important if you don't want to assume all pgs are part of the same world group. @kwen2501 yea planning to add other ops in additional PRs. |
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]
Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]
Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. [ghstack-poisoned]
Sorry I have more ops to suggggggest: Just a heads-up: in the implementation of these object coll functions, we should use the |
@kwen2501 |
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 ghstack-source-id: 6991c41 Pull Request resolved: #141054
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 d4l3k c-p-i-o [ghstack-poisoned]
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 ghstack-source-id: 6f61786 Pull Request resolved: #141054
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in #140460 Pull Request resolved: #141054 Approved by: https://github.com/kwen2501
Partly addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. Pull Request resolved: pytorch#140460 Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/fduwjj
Avoid copypaste of send/isend and recv/irecv impl. This does change the warning issued from send to include the identifier "isend" instead of "send", but I think thats not a big deal. Pull Request resolved: pytorch#140815 Approved by: https://github.com/fegin ghstack dependencies: pytorch#140460
…40820) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: pytorch#140820 Approved by: https://github.com/fegin ghstack dependencies: pytorch#140460, pytorch#140815
) Also add missing mypy typing and a few asserts to make mypy happy Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Note: object collective version canonicalizes to global instead of group rank, simply becuase this left more of the original code intact and required less conversions overall. Pull Request resolved: pytorch#140827 Approved by: https://github.com/kwen2501
Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140843 Approved by: https://github.com/kwen2501
…orch#140847) Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#140847 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#140843
…1054) Changes semantic of __repr__ of P2POp: s, d are now group ranks instead of global ranks. I think this is OK since I also updated the field names to make this obvious. Also add mypy annotations Partially addresses RFC 0042 (pytorch/rfcs#71) See more details/motivation in pytorch#140460 Pull Request resolved: pytorch#141054 Approved by: https://github.com/kwen2501
Addressing RFC 0042 (pytorch/rfcs#71) It's annoying that 'dst' (for send) ust be a global rank even when a group is passed in. But we can't easily change 'dst' without breaking existing cases. Furthermore, requiring use of 'global' dst breaks the less common usage pattern of creating a new ProcessGroup object that is not connected to the 'default group' and thus has no logical 'global' ranks. ghstack-source-id: 33ea136 Pull Request resolved: pytorch/pytorch#140460
Stack from ghstack (oldest at bottom):
Partly addressing RFC 0042 (pytorch/rfcs#71)
It's annoying that 'dst' (for send) ust be a global rank even when a
group is passed in. But we can't easily change 'dst' without breaking
existing cases.
Furthermore, requiring use of 'global' dst breaks the less common usage
pattern of creating a new ProcessGroup object that is not connected to
the 'default group' and thus has no logical 'global' ranks.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o