KEMBAR78
[c10d] simplify barrier implementation and further decouple CPU/GPU by shuqiangzhang · Pull Request #137516 · pytorch/pytorch · GitHub
Skip to content

Conversation

@shuqiangzhang
Copy link
Contributor

@shuqiangzhang shuqiangzhang commented Oct 8, 2024

Stack from ghstack (oldest at bottom):

synchronization
Summary:
Barrier is essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137516

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 36668c7 with merge base 9b2e453 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 8, 2024
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// The NCCL communicator used for this work item.
std::shared_ptr<NCCLComm> ncclComm_;

// Tensors used for barrier op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe also update the comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@shuqiangzhang
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 8, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But hold on a sec --
would kSynchronizeBusyWaitMillis = 10 have performance impact on barrier?

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 4, 5, linux.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@shuqiangzhang
Copy link
Contributor Author

shuqiangzhang commented Oct 9, 2024

But hold on a sec -- would kSynchronizeBusyWaitMillis = 10 have performance impact on barrier?

Usually barrier is used to sync multiple ranks with a much larger 'skew', e.g. different ranks enter a barrier with multiple seconds, or even mins of difference in a distributed scenarios. I am actually less worry about the perf implications on barrier, but its impact for normal collectives under blocking wait mode. We can configure it to be much smaller (e.g., 1ms, updated in the PR) for now if it is a concern.

…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 9, 2024
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

ghstack-source-id: c4e4155
Pull Request resolved: #137516
…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 9, 2024
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

ghstack-source-id: 46ee46a
Pull Request resolved: #137516
@kwen2501
Copy link
Contributor

kwen2501 commented Oct 9, 2024

I like this PR because it makes the barrier watchable by the main thread, if timeout is passed to wait.

But ooth, I am not sure what would be a good sleep interval for a barrier without timeout wait.

If we have a high-perf blocking wait implementation, maybe we would be in a better position to merge the two.

Before that, would temporarily keeping two paths bean option? for example:

synchronizeStream();

if (blockingWait_ || timeout != kNoTimeout)  // this has higher priority
  query + sleep yield;

else if (isBarrierOp_)  // this has second priority
  stream.synchronize() or event.synchronize() *

** If using event.synchronize(), event should be better created with the cudaEventBlockingSync flag. See doc here.

…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@shuqiangzhang
Copy link
Contributor Author

shuqiangzhang commented Oct 9, 2024

I am still not convinced that barrier performance matters with 1 ms granularity, but I am willing to trade off some minor code complexity for a potential maximum 1ms perf boost , changing it to stream sync as before.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision! Apology for not being able to be determined.

Comment on lines +709 to +710
// Abort NCCL communicators
abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR: maybe when we allow user to take control of the abort (after wait(timeout) throws exception), we should remove this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the plan, we provide tools, but leave the decision to users

Comment on lines +698 to +704
} else if (isBarrierOp_ && !isCompleted()) {
// For barrier wait when timeout is unspecified, we block the CPU thread on
// current stream. This is to minimize the CPU barrier wait time in healthy
// path
auto currentStream = at::cuda::getCurrentCUDAStream(device_.index());
// CUDAStream wrapper will correctly use a DeviceGuard here
currentStream.synchronize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding it back!

…e CPU/GPU"

synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
shuqiangzhang added a commit that referenced this pull request Oct 9, 2024
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

ghstack-source-id: d3f3a55
Pull Request resolved: #137516
@shuqiangzhang
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cuda12.1-py3 / build

Details for Dev Infra team Raised by workflow job

@shuqiangzhang
Copy link
Contributor Author

@pytorchbot merge -f "unrelated CI failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/shuqiangzhang/47/head branch November 10, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants