Add batch option for send/recv_object_list #160342

H-Huang · 2025-08-11T17:59:20Z

Stack from ghstack (oldest at bottom):

-> Add batch option for send/recv_object_list #160342

send_object_list and recv_object_list use regular send/recv P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.

This adds an option use_batch which will call the send/recv with batch_isend_irecv which will re-use the communicators already initialized for collectives in the group.

BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”

See:

pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

Lines 3980 to 4008 in c8205cb

    
             } else if (batchP2P) { 
        
               // TODO(whc) - unclear why we special-case batchP2P to avoid this path, but 
        
               // I preserved this existing special case. 
        
               key = getKeyFromDevice(device); 
        
               p2pRank = rank_; 
        
               p2pTargetRank = peer; 
        
               ncclComm = getNCCLComm(key); 
        
             } else { 
        
               // We create special 2-rank communicators for each pair of 
        
               // send/recv ranks.  This limitation exists for two reasons: (1) 
        
               // we use a single stream per communicator, so if multiple 
        
               // unbatched p2p operations are issued on the same communicator, 
        
               // they would map to the same stream and thus would be serialized; 
        
               // and (2) Nvidia NCCL does not allow multiple p2p operations to 
        
               // be issued on the same communicator over different streams. 
        
               TORCH_WARN_ONCE( 
        
                   "An unbatched P2P op (send/recv) was called on this ", 
        
                   "ProcessGroup with size ", 
        
                   groupRanks().size(), 
        
                   ".  In lazy initialization mode, this will result in a new 2-rank", 
        
                   " NCCL communicator to be created."); 
        
               key = getKeySendRecv(rank_, peer); 
        
               /* if we are creating a new comm, reset the p2pRank and 
        
                * p2pTargetRank to correspond to this new 2-process communicator */ 
        
               p2pRank = rank_ <= peer ? 0 : 1; 
        
               p2pTargetRank = isSendRecvSelf ? 0 : 1 - p2pRank; 
        
               ncclComm = getNCCLComm(key);

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

[ghstack-poisoned]

pytorch-bot · 2025-08-11T17:59:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160342

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm MI2xx CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

✅ No Failures

As of commit f77258f with merge base 0e45023 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d01481e Pull-Request: #160342

wconstab · 2025-08-11T18:27:37Z

the implementation seems correct but i'm questioning whether we want the complexity; what's the motivation? and would we get there a different way if we instead pushed harder on deprecating lazy init altogether..?

H-Huang · 2025-08-11T18:31:58Z

@wconstab The motivation is that it gives an option for users to not have to create 2-rank NCCL communicators if they don't want to. So for us in PP, since we only use batch_isend_irecv this is kinda an escape hatch for us.

Will we be able to deprecate lazy init for regular P2P ops? I think it makes sense that when PG is created we can establish the NCCL communicators for each rank. But in regular P2P ops there is also a communicator for each pair of ranks so I assume we don't want to also create those extra ones as well?

wconstab

lgtm, thanks!

ghstack-source-id: d01481e Pull-Request: #160342 [ghstack-poisoned]

ghstack-source-id: d01481e Pull-Request: #160342 ghstack-source-id: 58ce5cd Pull Request resolved: #161811

[ghstack-poisoned]

ghstack-source-id: 027865c Pull-Request: #160342

H-Huang · 2025-08-30T03:21:52Z

@pytorchbot merge

pytorchmergebot · 2025-08-30T03:23:34Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized. This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group. --- BatchP2P ops, creates (or use existing) communicator keyed by device index Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2” See: https://github.com/pytorch/pytorch/blob/c8205cb35435f39d2c26f6c94b45e4adeb6dcb23/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L3980-L4008 Pull Request resolved: pytorch#160342 Approved by: https://github.com/wconstab

Update

7e79a37

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 11, 2025

H-Huang added a commit that referenced this pull request Aug 11, 2025

Add batch option for send/recv_object_list

d435d38

ghstack-source-id: d01481e Pull-Request: #160342

H-Huang mentioned this pull request Aug 11, 2025

[PP] Initialize P2P communicators on first step #160210

Closed

H-Huang requested review from kwen2501 and wconstab and removed request for wconstab August 11, 2025 18:27

H-Huang mentioned this pull request Aug 11, 2025

PP Stage init hangs on multi-nodes pytorch/torchtitan#1492

Closed

wconstab approved these changes Aug 27, 2025

View reviewed changes

H-Huang added a commit that referenced this pull request Aug 29, 2025

Add batch option for send/recv_object_list

faa02fb

ghstack-source-id: d01481e Pull-Request: #160342 [ghstack-poisoned]

H-Huang mentioned this pull request Aug 29, 2025

Add batch option for send/recv_object_list #161811

Closed

H-Huang added a commit that referenced this pull request Aug 29, 2025

Add batch option for send/recv_object_list

1a42e9e

ghstack-source-id: d01481e Pull-Request: #160342 ghstack-source-id: 58ce5cd Pull Request resolved: #161811

Update

f77258f

[ghstack-poisoned]

H-Huang added a commit that referenced this pull request Aug 29, 2025

Add batch option for send/recv_object_list

f03e58e

ghstack-source-id: 027865c Pull-Request: #160342

H-Huang added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2025

pytorchmergebot added the merging label Aug 30, 2025

pytorchmergebot closed this in 82d2d23 Aug 30, 2025

pytorchmergebot added Merged and removed merging labels Aug 30, 2025

github-actions bot deleted the gh/H-Huang/206/head branch September 30, 2025 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add batch option for send/recv_object_list #160342

Add batch option for send/recv_object_list #160342

Uh oh!

H-Huang commented Aug 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 11, 2025 •

edited

Loading

Uh oh!

wconstab commented Aug 11, 2025

Uh oh!

H-Huang commented Aug 11, 2025

Uh oh!

wconstab left a comment

Uh oh!

H-Huang commented Aug 30, 2025

Uh oh!

pytorchmergebot commented Aug 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	} else if (batchP2P) {
	// TODO(whc) - unclear why we special-case batchP2P to avoid this path, but
	// I preserved this existing special case.
	key = getKeyFromDevice(device);
	p2pRank = rank_;
	p2pTargetRank = peer;
	ncclComm = getNCCLComm(key);
	} else {
	// We create special 2-rank communicators for each pair of
	// send/recv ranks. This limitation exists for two reasons: (1)
	// we use a single stream per communicator, so if multiple
	// unbatched p2p operations are issued on the same communicator,
	// they would map to the same stream and thus would be serialized;
	// and (2) Nvidia NCCL does not allow multiple p2p operations to
	// be issued on the same communicator over different streams.

	TORCH_WARN_ONCE(
	"An unbatched P2P op (send/recv) was called on this ",
	"ProcessGroup with size ",
	groupRanks().size(),
	". In lazy initialization mode, this will result in a new 2-rank",
	" NCCL communicator to be created.");

	key = getKeySendRecv(rank_, peer);
	/* if we are creating a new comm, reset the p2pRank and
	* p2pTargetRank to correspond to this new 2-process communicator */
	p2pRank = rank_ <= peer ? 0 : 1;
	p2pTargetRank = isSendRecvSelf ? 0 : 1 - p2pRank;
	ncclComm = getNCCLComm(key);

Add batch option for send/recv_object_list #160342

Add batch option for send/recv_object_list #160342

Uh oh!

Conversation

H-Huang commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160342

❗ 1 Active SEVs

✅ No Failures

Uh oh!

wconstab commented Aug 11, 2025

Uh oh!

H-Huang commented Aug 11, 2025

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Aug 30, 2025

Uh oh!

pytorchmergebot commented Aug 30, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

H-Huang commented Aug 11, 2025 •

edited

Loading

pytorch-bot bot commented Aug 11, 2025 •

edited

Loading