KEMBAR78
Add batch option for send/recv_object_list by H-Huang · Pull Request #160342 · pytorch/pytorch · GitHub
Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 11, 2025

Stack from ghstack (oldest at bottom):

send_object_list and recv_object_list use regular send/recv P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.

This adds an option use_batch which will call the send/recv with batch_isend_irecv which will re-use the communicators already initialized for collectives in the group.


BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”

See:

} else if (batchP2P) {
// TODO(whc) - unclear why we special-case batchP2P to avoid this path, but
// I preserved this existing special case.
key = getKeyFromDevice(device);
p2pRank = rank_;
p2pTargetRank = peer;
ncclComm = getNCCLComm(key);
} else {
// We create special 2-rank communicators for each pair of
// send/recv ranks. This limitation exists for two reasons: (1)
// we use a single stream per communicator, so if multiple
// unbatched p2p operations are issued on the same communicator,
// they would map to the same stream and thus would be serialized;
// and (2) Nvidia NCCL does not allow multiple p2p operations to
// be issued on the same communicator over different streams.
TORCH_WARN_ONCE(
"An unbatched P2P op (send/recv) was called on this ",
"ProcessGroup with size ",
groupRanks().size(),
". In lazy initialization mode, this will result in a new 2-rank",
" NCCL communicator to be created.");
key = getKeySendRecv(rank_, peer);
/* if we are creating a new comm, reset the p2pRank and
* p2pTargetRank to correspond to this new 2-process communicator */
p2pRank = rank_ <= peer ? 0 : 1;
p2pTargetRank = isSendRecvSelf ? 0 : 1 - p2pRank;
ncclComm = getNCCLComm(key);

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160342

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit f77258f with merge base 0e45023 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 11, 2025
H-Huang added a commit that referenced this pull request Aug 11, 2025
@wconstab
Copy link
Contributor

the implementation seems correct but i'm questioning whether we want the complexity; what's the motivation? and would we get there a different way if we instead pushed harder on deprecating lazy init altogether..?

@H-Huang H-Huang requested review from kwen2501 and wconstab and removed request for wconstab August 11, 2025 18:27
@H-Huang
Copy link
Member Author

H-Huang commented Aug 11, 2025

@wconstab The motivation is that it gives an option for users to not have to create 2-rank NCCL communicators if they don't want to. So for us in PP, since we only use batch_isend_irecv this is kinda an escape hatch for us.

Will we be able to deprecate lazy init for regular P2P ops? I think it makes sense that when PG is created we can establish the NCCL communicators for each rank. But in regular P2P ops there is also a communicator for each pair of ranks so I assume we don't want to also create those extra ones as well?

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

H-Huang added a commit that referenced this pull request Aug 29, 2025
ghstack-source-id: d01481e
Pull-Request: #160342

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 29, 2025
ghstack-source-id: d01481e
Pull-Request: #160342

ghstack-source-id: 58ce5cd
Pull Request resolved: #161811
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 29, 2025
@H-Huang H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2025
@H-Huang
Copy link
Member Author

H-Huang commented Aug 30, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.

This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group.

---

BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”

See:

https://github.com/pytorch/pytorch/blob/c8205cb35435f39d2c26f6c94b45e4adeb6dcb23/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L3980-L4008

Pull Request resolved: pytorch#160342
Approved by: https://github.com/wconstab
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.

This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group.

---

BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”

See:

https://github.com/pytorch/pytorch/blob/c8205cb35435f39d2c26f6c94b45e4adeb6dcb23/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L3980-L4008

Pull Request resolved: pytorch#160342
Approved by: https://github.com/wconstab
@github-actions github-actions bot deleted the gh/H-Huang/206/head branch September 30, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants