-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] simplify barrier implementation and further decouple CPU/GPU #137516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137516
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 36668c7 with merge base 9b2e453 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| // The NCCL communicator used for this work item. | ||
| std::shared_ptr<NCCLComm> ncclComm_; | ||
|
|
||
| // Tensors used for barrier op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe also update the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But hold on a sec --
would kSynchronizeBusyWaitMillis = 10 have performance impact on barrier?
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 4, 5, linux.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
Usually barrier is used to sync multiple ranks with a much larger 'skew', e.g. different ranks enter a barrier with multiple seconds, or even mins of difference in a distributed scenarios. I am actually less worry about the perf implications on barrier, but its impact for normal collectives under blocking wait mode. We can configure it to be much smaller (e.g., 1ms, updated in the PR) for now if it is a concern. |
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: ghstack-source-id: c4e4155 Pull Request resolved: #137516
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: ghstack-source-id: 46ee46a Pull Request resolved: #137516
|
I like this PR because it makes the barrier watchable by the main thread, if timeout is passed to wait. But ooth, I am not sure what would be a good sleep interval for a barrier without timeout wait. If we have a high-perf blocking wait implementation, maybe we would be in a better position to merge the two. Before that, would temporarily keeping two paths bean option? for example: ** If using |
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
I am still not convinced that barrier performance matters with 1 ms granularity, but I am willing to trade off some minor code complexity for a potential maximum 1ms perf boost , changing it to stream sync as before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revision! Apology for not being able to be determined.
| // Abort NCCL communicators | ||
| abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR: maybe when we allow user to take control of the abort (after wait(timeout) throws exception), we should remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's the plan, we provide tools, but leave the decision to users
| } else if (isBarrierOp_ && !isCompleted()) { | ||
| // For barrier wait when timeout is unspecified, we block the CPU thread on | ||
| // current stream. This is to minimize the CPU barrier wait time in healthy | ||
| // path | ||
| auto currentStream = at::cuda::getCurrentCUDAStream(device_.index()); | ||
| // CUDAStream wrapper will correctly use a DeviceGuard here | ||
| currentStream.synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding it back!
…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: ghstack-source-id: d3f3a55 Pull Request resolved: #137516
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cuda12.1-py3 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "unrelated CI failure" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
synchronization
Summary:
Barrier is essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.
This PR cleans this.
Test Plan:
CI
Tags:
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o