[PGNCCL] Fix P2P data corruption in non-blocking mode #138860

kwen2501 · 2024-10-24T22:58:30Z

Stack from ghstack (oldest at bottom):

In non-blocking mode, it seems a single ncclRecv or ncclSend call can "early return" ncclSuccess before the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption.

Side note: per NCCL, it is legal to call ncclSend or ncclRecv only if there is only one P2P op. This is true whether we are in blocking or non-blocking mode.

In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead.

Added a test test_non_blocking_p2p.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-24T22:58:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138860

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f1d6128 with merge base 4c91481 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2024-10-24T23:07:08Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

-      ncclComm->getNcclComm(),
-      ncclComm->getNcclCommFailureReason());
+  // In non-blocking mode, we need to use ncclGroup semantics to ensure that the
+  // kernel is enqueued for single-P2P ops.  Otherwise, the event record below


i don't understand this explanation.

why is this specific to p2p ops? when in non-blocking mode, do all collectives need to be in a 'group' to ensure they are correctly enqueued before the subsequent event record?

I feel we would need better contract / documentation from NCCL in terms of what ncclSuccess means for P2P calls -- i.e. whether kernel is guaranteed to have been enqueued or not. In this instant case, it seems not, so a group call is used to help it.

Re why specific to P2P ops:
P2P ops involves dynamic connection establishment. This is the part made non-blocking in non-blocking mode.

hmm, ok that kinda makes sense.

however i have some concerns still: this change would effectively turn all P2P ops into grouped-P2P ops of size 1. Does grouping add overhead to P2P performance that could be avoided in the blocking case? (Overhead either on CPU side or on GPU side- does a 'group' p2p run a different kernel for example?

No difference on GPU side.
Re CPU side, I wouldn't say there are not more instructions, but the overhead is minimal.
ncclGroupStart() is nothing but increasing a counter.
ncclGroupEnd() decreases the counter; if it reaches 0, kick off.

shuqiangzhang · 2024-10-24T23:18:40Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  // kernel is enqueued for single-P2P ops.  Otherwise, the event record below
+  // may not capture the kernel, leading to data corruption.
+  ncclGroupStart();
+  C10D_NCCL_CHECK_NONBLOCKING(


Should this MACRO be applied when non-blocking mode is on? otherwise we should just use C10D_NCCL_CHECK

I don't quite understand the question. We are in the NCCL_HAS_COMM_NONBLOCKING region, and we are using the C10D_NCCL_CHECK_NONBLOCKING macro.

Sorry, I mean why not use C10D_NCCL_CHECK_TIMEOUT to wait for ncclSuccess, same as those used in other collectives

Oh, the TIMEOUT check is deferred to the ncclGroupEnd() call. I think that's more efficient.

i.e. we don't have timeout concern for calls within group semantics.

In non-blocking mode, we need to use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. Otherwise, the event record below may not capture the kernel, leading to data corruption. Added a test `test_non_blocking_p2p`. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

eqy · 2024-10-25T00:35:16Z

CC @KaimingOuyang for comment re: https://github.com/pytorch/pytorch/pull/138860/files#r1815798309

kwen2501 · 2024-10-25T23:51:27Z

@KaimingOuyang confirmed that there is a bug in NCCL re ncclRecv or ncclSend: if we don't poll on ncclCommGetAsyncError, the kernel may not be fully enqueued -- which sounds like ncclRecv can directly return ncclSuccess instead of ncclInProgress.
I personally feel that having a group call here is future-proof, in that even after NCCL fix the bug, we don't have to take the group call out; and the group call overhead is minimal.
Cc: @eqy

kwen2501 · 2024-10-25T23:51:39Z

@pytorchbot merge

pytorchmergebot · 2024-10-25T23:53:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc. ![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd) ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860

[PGNCCL] Fix P2P data corruption in non-blocking mode

2c2d320

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 24, 2024

kwen2501 mentioned this pull request Oct 24, 2024

[PGNCCL] Use non-blocking mode by default in eager init #138527

Closed

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 24, 2024

kwen2501 added the topic: bug fixes topic category label Oct 24, 2024

kwen2501 requested review from fduwjj, shuqiangzhang and wconstab October 24, 2024 22:59

wconstab reviewed Oct 24, 2024

View reviewed changes

shuqiangzhang reviewed Oct 24, 2024

View reviewed changes

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2024

kwen2501 added 2 commits October 24, 2024 17:03

kwen2501 requested a review from eqy October 25, 2024 00:26

shuqiangzhang approved these changes Oct 25, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 25, 2024

pytorchmergebot added the Merged label Oct 25, 2024

pytorchmergebot closed this in 939fc4e Oct 25, 2024

pytorchmergebot removed the merging label Oct 25, 2024

kwen2501 mentioned this pull request Oct 26, 2024

P2P in non-blocking mode NVIDIA/nccl#1495

Closed

github-actions bot deleted the gh/kwen2501/84/head branch November 25, 2024 02:12

[PGNCCL] Fix P2P data corruption in non-blocking mode #138860

[PGNCCL] Fix P2P data corruption in non-blocking mode #138860

Uh oh!

Conversation

kwen2501 commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138860

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eqy commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Oct 25, 2024

Uh oh!

kwen2501 commented Oct 25, 2024

Uh oh!

pytorchmergebot commented Oct 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Oct 24, 2024 •

edited

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading

kwen2501 Oct 24, 2024 •

edited

Loading

eqy commented Oct 25, 2024 •

edited

Loading