-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[PGNCCL] Fix P2P data corruption in non-blocking mode #138860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138860
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f1d6128 with merge base 4c91481 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| ncclComm->getNcclComm(), | ||
| ncclComm->getNcclCommFailureReason()); | ||
| // In non-blocking mode, we need to use ncclGroup semantics to ensure that the | ||
| // kernel is enqueued for single-P2P ops. Otherwise, the event record below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand this explanation.
- why is this specific to p2p ops? when in non-blocking mode, do all collectives need to be in a 'group' to ensure they are correctly enqueued before the subsequent event record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we would need better contract / documentation from NCCL in terms of what ncclSuccess means for P2P calls -- i.e. whether kernel is guaranteed to have been enqueued or not. In this instant case, it seems not, so a group call is used to help it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re why specific to P2P ops:
P2P ops involves dynamic connection establishment. This is the part made non-blocking in non-blocking mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok that kinda makes sense.
however i have some concerns still: this change would effectively turn all P2P ops into grouped-P2P ops of size 1. Does grouping add overhead to P2P performance that could be avoided in the blocking case? (Overhead either on CPU side or on GPU side- does a 'group' p2p run a different kernel for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference on GPU side.
Re CPU side, I wouldn't say there are not more instructions, but the overhead is minimal.
ncclGroupStart() is nothing but increasing a counter.
ncclGroupEnd() decreases the counter; if it reaches 0, kick off.
| // kernel is enqueued for single-P2P ops. Otherwise, the event record below | ||
| // may not capture the kernel, leading to data corruption. | ||
| ncclGroupStart(); | ||
| C10D_NCCL_CHECK_NONBLOCKING( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this MACRO be applied when non-blocking mode is on? otherwise we should just use C10D_NCCL_CHECK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the question. We are in the NCCL_HAS_COMM_NONBLOCKING region, and we are using the C10D_NCCL_CHECK_NONBLOCKING macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean why not use C10D_NCCL_CHECK_TIMEOUT to wait for ncclSuccess, same as those used in other collectives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the TIMEOUT check is deferred to the ncclGroupEnd() call. I think that's more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. we don't have timeout concern for calls within group semantics.
In non-blocking mode, we need to use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. Otherwise, the event record below may not capture the kernel, leading to data corruption. Added a test `test_non_blocking_p2p`. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
In non-blocking mode, we need to use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. Otherwise, the event record below may not capture the kernel, leading to data corruption. Added a test `test_non_blocking_p2p`. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
CC @KaimingOuyang for comment re: https://github.com/pytorch/pytorch/pull/138860/files#r1815798309 |
|
@KaimingOuyang confirmed that there is a bug in NCCL re |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860
### Why use non-blocking mode in eager init? For overlapping comm init and model init, etc.  ### Why can we set non-blocking as default? If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`). ### Why not make non-blocking default for lazy mode as well? PR #137544 tried it. Two reasons why that's not preferred today: 1. It is hard -- too big a blast. 2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode. Pull Request resolved: #138527 Approved by: https://github.com/wconstab ghstack dependencies: #138860
Stack from ghstack (oldest at bottom):
In non-blocking mode, it seems a single
ncclRecvorncclSendcall can "early return"ncclSuccessbefore the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption.Side note: per NCCL, it is legal to call
ncclSendorncclRecvonly if there is only one P2P op. This is true whether we are in blocking or non-blocking mode.In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead.
Added a test
test_non_blocking_p2p.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o