[PGNCCL] Use long/short wait for different non-blocking calls #142291

kwen2501 · 2024-12-07T02:27:29Z

Stack from ghstack (oldest at bottom):

-> [PGNCCL] Use long/short wait for different non-blocking calls #142291

In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the waitReady() function.
Unfortunately, the waitReady() function is burned with C10D_NCCL_CHECK_TIMEOUT_SLEEP which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a bool longInterval argument to waitReady and let call site determine whether long wait is likely; if not, waitReady would use sched_yield() to more eagerly check for readiness.

Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-12-07T02:27:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142291

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

MacOS tests has not been running for few weeks

✅ No Failures

As of commit 0265eac with merge base 0610b97 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 0b35f05 Pull Request resolved: #142291

eqy

approving but the failures look real

…alls" In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it. Today this is done by the `waitReady()` function. Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks. While this is nice when waiting for comm init or finalize to complete, it degrades performance of collective calls (which would almost certainly return success immediately.) This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. Thanks eqy for reporting an issue that small collectives has perf impact in nonblocking mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 0968ae4 Pull Request resolved: #142291

kwen2501 · 2024-12-07T05:08:39Z

Thanks @eqy , yeah, just fixed them

kwen2501 · 2024-12-07T08:04:39Z

@pytorchbot merge

pytorchmergebot · 2024-12-07T08:06:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-12-07T08:17:09Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

fduwjj · 2024-12-09T16:40:03Z

torch/csrc/distributed/c10d/NCCLUtils.cpp

-    waitReady();
+    // Wait with long interval if communicator is being initialized.
+    bool longInterval = !initialized_;
+    waitReady(longInterval);


can you maybe directly pass in (!initialized_)?

…alls" In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it. Today this is done by the `waitReady()` function. Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks. While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.) This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. Thanks eqy for reporting an issue that small collectives has perf impact in nonblocking mode. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: a96acb6 Pull Request resolved: #142291

kwen2501 · 2024-12-09T20:04:08Z

@pytorchbot merge

pytorchmergebot · 2024-12-09T20:06:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@eqy

…h#142291) In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it. Today this is done by the `waitReady()` function. Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks. While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.) This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode. Pull Request resolved: pytorch#142291 Approved by: https://github.com/eqy, https://github.com/fduwjj

[PGNCCL] Use long/short wait with different non-blocking calls

602434d

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 7, 2024

kwen2501 added a commit that referenced this pull request Dec 7, 2024

[PGNCCL] Use long/short wait with different non-blocking calls

1b3f913

ghstack-source-id: 0b35f05 Pull Request resolved: #142291

kwen2501 requested review from eqy, fduwjj and wconstab December 7, 2024 02:27

kwen2501 added the topic: bug fixes topic category label Dec 7, 2024

eqy approved these changes Dec 7, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Dec 7, 2024

[PGNCCL] Use long/short wait with different non-blocking calls

befa309

ghstack-source-id: 0968ae4 Pull Request resolved: #142291

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 7, 2024

pytorchmergebot added the merging label Dec 7, 2024

pytorchmergebot removed the merging label Dec 7, 2024

fduwjj approved these changes Dec 9, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Dec 9, 2024

[PGNCCL] Use long/short wait with different non-blocking calls

fe38568

ghstack-source-id: a96acb6 Pull Request resolved: #142291

kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Dec 9, 2024

kwen2501 changed the title ~~[PGNCCL] Use long/short wait with different non-blocking calls~~ [PGNCCL] Use long/short wait for different non-blocking calls Dec 9, 2024

pytorchmergebot added the merging label Dec 9, 2024

pytorchmergebot added the Merged label Dec 9, 2024

pytorchmergebot closed this in cca33d5 Dec 9, 2024

pytorchmergebot removed the merging label Dec 9, 2024

github-actions bot deleted the gh/kwen2501/113/head branch January 9, 2025 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PGNCCL] Use long/short wait for different non-blocking calls #142291

[PGNCCL] Use long/short wait for different non-blocking calls #142291

Uh oh!

kwen2501 commented Dec 7, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 7, 2024 •

edited

Loading

Uh oh!

eqy left a comment

Uh oh!

kwen2501 commented Dec 7, 2024

Uh oh!

kwen2501 commented Dec 7, 2024

Uh oh!

pytorchmergebot commented Dec 7, 2024

Uh oh!

pytorchmergebot commented Dec 7, 2024

Uh oh!

fduwjj Dec 9, 2024

Uh oh!

kwen2501 commented Dec 9, 2024

Uh oh!

pytorchmergebot commented Dec 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[PGNCCL] Use long/short wait for different non-blocking calls #142291

[PGNCCL] Use long/short wait for different non-blocking calls #142291

Uh oh!

Conversation

kwen2501 commented Dec 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142291

❗ 1 Active SEVs

✅ No Failures

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Dec 7, 2024

Uh oh!

kwen2501 commented Dec 7, 2024

Uh oh!

pytorchmergebot commented Dec 7, 2024

Merge started

Uh oh!

pytorchmergebot commented Dec 7, 2024

Merge failed

Uh oh!

fduwjj Dec 9, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Dec 9, 2024

Uh oh!

pytorchmergebot commented Dec 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Dec 7, 2024 •

edited

Loading

pytorch-bot bot commented Dec 7, 2024 •

edited

Loading