[c10d] simplify barrier implementation and further decouple CPU/GPU #137516

shuqiangzhang · 2024-10-08T19:41:46Z

Stack from ghstack (oldest at bottom):

-> [c10d] simplify barrier implementation and further decouple CPU/GPU #137516

synchronization
Summary:
Barrier is essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: [ghstack-poisoned]

pytorch-bot · 2024-10-08T19:41:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137516

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 36668c7 with merge base 9b2e453 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj

LGTM

fduwjj · 2024-10-08T21:46:16Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

    // The NCCL communicator used for this work item.
    std::shared_ptr<NCCLComm> ncclComm_;

    // Tensors used for barrier op


nit: maybe also update the comment?

Good point!

…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

shuqiangzhang · 2024-10-08T21:59:44Z

@pytorchbot merge

pytorchmergebot · 2024-10-08T22:01:42Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501

LGTM.

kwen2501

But hold on a sec --
would kSynchronizeBusyWaitMillis = 10 have performance impact on barrier?

pytorchmergebot · 2024-10-09T00:35:53Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 4, 5, linux.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

shuqiangzhang · 2024-10-09T01:07:10Z

But hold on a sec -- would kSynchronizeBusyWaitMillis = 10 have performance impact on barrier?

Usually barrier is used to sync multiple ranks with a much larger 'skew', e.g. different ranks enter a barrier with multiple seconds, or even mins of difference in a distributed scenarios. I am actually less worry about the perf implications on barrier, but its impact for normal collectives under blocking wait mode. We can configure it to be much smaller (e.g., 1ms, updated in the PR) for now if it is a concern.

…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: ghstack-source-id: c4e4155 Pull Request resolved: #137516

…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: ghstack-source-id: 46ee46a Pull Request resolved: #137516

kwen2501 · 2024-10-09T09:43:32Z

I like this PR because it makes the barrier watchable by the main thread, if timeout is passed to wait.

But ooth, I am not sure what would be a good sleep interval for a barrier without timeout wait.

If we have a high-perf blocking wait implementation, maybe we would be in a better position to merge the two.

Before that, would temporarily keeping two paths bean option? for example:

synchronizeStream();

if (blockingWait_ || timeout != kNoTimeout)  // this has higher priority
  query + sleep yield;

else if (isBarrierOp_)  // this has second priority
  stream.synchronize() or event.synchronize() *

** If using event.synchronize(), event should be better created with the cudaEventBlockingSync flag. See doc here.

…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

shuqiangzhang · 2024-10-09T17:34:51Z

I am still not convinced that barrier performance matters with 1 ms granularity, but I am willing to trade off some minor code complexity for a potential maximum 1ms perf boost , changing it to stream sync as before.

kwen2501

Thanks for the revision! Apology for not being able to be determined.

kwen2501 · 2024-10-09T20:11:08Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    // Abort NCCL communicators
+    abort();


Not related to this PR: maybe when we allow user to take control of the abort (after wait(timeout) throws exception), we should remove this line?

yes, that's the plan, we provide tools, but leave the decision to users

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

kwen2501 · 2024-10-09T20:12:29Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  } else if (isBarrierOp_ && !isCompleted()) {
+    // For barrier wait when timeout is unspecified, we block the CPU thread on
+    // current stream. This is to minimize the CPU barrier wait time in healthy
+    // path
+    auto currentStream = at::cuda::getCurrentCUDAStream(device_.index());
+    // CUDAStream wrapper will correctly use a DeviceGuard here
+    currentStream.synchronize();


Thank you for adding it back!

…e CPU/GPU" synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

synchronization Summary: Barrier is essentially intended to block CPU thread (instead of GPU streams). Before we used 2 stream synchronizations (1. current stream blocked by nccl stream end event, 2. CPU thread blocked on current stream). This is unnecessary as we already have CPU thread blocking logic in wait(). Also, adding barrier specific code block in the general GPU synchronize() API is intrusive and confusing. This PR cleans this. Test Plan: CI Tags: ghstack-source-id: d3f3a55 Pull Request resolved: #137516

shuqiangzhang · 2024-10-09T20:51:23Z

@pytorchbot merge

pytorchmergebot · 2024-10-09T20:53:18Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-09T20:53:24Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cuda12.1-py3 / build

Details for Dev Infra team

Raised by workflow job

shuqiangzhang · 2024-10-09T23:53:29Z

@pytorchbot merge -f "unrelated CI failure"

pytorchmergebot · 2024-10-09T23:55:20Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 8, 2024

shuqiangzhang requested review from H-Huang, c-p-i-o, d4l3k, fduwjj, kwen2501 and wconstab October 8, 2024 19:43

fduwjj approved these changes Oct 8, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2024

pytorchmergebot added the merging label Oct 8, 2024

kwen2501 approved these changes Oct 8, 2024

View reviewed changes

kwen2501 requested changes Oct 8, 2024

View reviewed changes

pytorchmergebot removed the merging label Oct 9, 2024

kwen2501 approved these changes Oct 9, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 9, 2024

pytorchmergebot removed the merging label Oct 9, 2024

pytorchmergebot added the merging label Oct 9, 2024

pytorchmergebot closed this in 47a515d Oct 9, 2024

pytorchmergebot added Merged and removed merging labels Oct 9, 2024

github-actions bot deleted the gh/shuqiangzhang/47/head branch November 10, 2024 02:06

[c10d] simplify barrier implementation and further decouple CPU/GPU #137516

[c10d] simplify barrier implementation and further decouple CPU/GPU #137516

Uh oh!

Conversation

shuqiangzhang commented Oct 8, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137516

✅ No Failures

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Oct 8, 2024

Uh oh!

pytorchmergebot commented Oct 8, 2024

Merge started

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge failed

Uh oh!

shuqiangzhang commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuqiangzhang commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kwen2501 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang commented Oct 9, 2024

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge failed

Uh oh!

shuqiangzhang commented Oct 9, 2024

Uh oh!

pytorchmergebot commented Oct 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shuqiangzhang commented Oct 8, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 8, 2024 •

edited

Loading

shuqiangzhang commented Oct 9, 2024 •

edited

Loading

kwen2501 commented Oct 9, 2024 •

edited

Loading

shuqiangzhang commented Oct 9, 2024 •

edited

Loading