KEMBAR78
[PGNCCL] Fix P2P data corruption in non-blocking mode by kwen2501 · Pull Request #138860 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Oct 24, 2024

Stack from ghstack (oldest at bottom):

In non-blocking mode, it seems a single ncclRecv or ncclSend call can "early return" ncclSuccess before the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption.

Side note: per NCCL, it is legal to call ncclSend or ncclRecv only if there is only one P2P op. This is true whether we are in blocking or non-blocking mode.

In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead.

Added a test test_non_blocking_p2p.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138860

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f1d6128 with merge base 4c91481 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 24, 2024
@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 24, 2024
@kwen2501 kwen2501 added the topic: bug fixes topic category label Oct 24, 2024
ncclComm->getNcclComm(),
ncclComm->getNcclCommFailureReason());
// In non-blocking mode, we need to use ncclGroup semantics to ensure that the
// kernel is enqueued for single-P2P ops. Otherwise, the event record below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand this explanation.

  • why is this specific to p2p ops? when in non-blocking mode, do all collectives need to be in a 'group' to ensure they are correctly enqueued before the subsequent event record?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we would need better contract / documentation from NCCL in terms of what ncclSuccess means for P2P calls -- i.e. whether kernel is guaranteed to have been enqueued or not. In this instant case, it seems not, so a group call is used to help it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re why specific to P2P ops:
P2P ops involves dynamic connection establishment. This is the part made non-blocking in non-blocking mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok that kinda makes sense.

however i have some concerns still: this change would effectively turn all P2P ops into grouped-P2P ops of size 1. Does grouping add overhead to P2P performance that could be avoided in the blocking case? (Overhead either on CPU side or on GPU side- does a 'group' p2p run a different kernel for example?

Copy link
Contributor Author

@kwen2501 kwen2501 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference on GPU side.
Re CPU side, I wouldn't say there are not more instructions, but the overhead is minimal.
ncclGroupStart() is nothing but increasing a counter.
ncclGroupEnd() decreases the counter; if it reaches 0, kick off.

// kernel is enqueued for single-P2P ops. Otherwise, the event record below
// may not capture the kernel, leading to data corruption.
ncclGroupStart();
C10D_NCCL_CHECK_NONBLOCKING(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this MACRO be applied when non-blocking mode is on? otherwise we should just use C10D_NCCL_CHECK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the question. We are in the NCCL_HAS_COMM_NONBLOCKING region, and we are using the C10D_NCCL_CHECK_NONBLOCKING macro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean why not use C10D_NCCL_CHECK_TIMEOUT to wait for ncclSuccess, same as those used in other collectives

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the TIMEOUT check is deferred to the ncclGroupEnd() call. I think that's more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. we don't have timeout concern for calls within group semantics.

@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2024
In non-blocking mode, we need to use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops.  Otherwise, the event record below may not capture the kernel, leading to data corruption.

Added a test `test_non_blocking_p2p`.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
In non-blocking mode, we need to use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops.  Otherwise, the event record below may not capture the kernel, leading to data corruption.

Added a test `test_non_blocking_p2p`.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501 kwen2501 requested a review from eqy October 25, 2024 00:26
@eqy
Copy link
Collaborator

eqy commented Oct 25, 2024

CC @KaimingOuyang for comment re: https://github.com/pytorch/pytorch/pull/138860/files#r1815798309

@kwen2501
Copy link
Contributor Author

@KaimingOuyang confirmed that there is a bug in NCCL re ncclRecv or ncclSend: if we don't poll on ncclCommGetAsyncError, the kernel may not be fully enqueued -- which sounds like ncclRecv can directly return ncclSuccess instead of ncclInProgress.
I personally feel that having a group call here is future-proof, in that even after NCCL fix the bug, we don't have to take the group call out; and the group call overhead is minimal.
Cc: @eqy

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Oct 26, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
pytorchmergebot pushed a commit that referenced this pull request Oct 27, 2024
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR #137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: #138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
@github-actions github-actions bot deleted the gh/kwen2501/84/head branch November 25, 2024 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants