KEMBAR78
[PGNCCL] Use long/short wait for different non-blocking calls by kwen2501 · Pull Request #142291 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Dec 7, 2024

Stack from ghstack (oldest at bottom):

In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the waitReady() function.
Unfortunately, the waitReady() function is burned with C10D_NCCL_CHECK_TIMEOUT_SLEEP which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a bool longInterval argument to waitReady and let call site determine whether long wait is likely; if not, waitReady would use sched_yield() to more eagerly check for readiness.

Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142291

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 0265eac with merge base 0610b97 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 7, 2024
kwen2501 added a commit that referenced this pull request Dec 7, 2024
@kwen2501 kwen2501 requested review from eqy, fduwjj and wconstab December 7, 2024 02:27
@kwen2501 kwen2501 added the topic: bug fixes topic category label Dec 7, 2024
Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving but the failures look real

…alls"


In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function. 
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize to complete, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. 

Thanks eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Dec 7, 2024
@kwen2501
Copy link
Contributor Author

kwen2501 commented Dec 7, 2024

Thanks @eqy , yeah, just fixed them

@kwen2501
Copy link
Contributor Author

kwen2501 commented Dec 7, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 7, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

waitReady();
// Wait with long interval if communicator is being initialized.
bool longInterval = !initialized_;
waitReady(longInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you maybe directly pass in (!initialized_)?

…alls"


In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function. 
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness. 

Thanks eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Dec 9, 2024
@kwen2501 kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Dec 9, 2024
@kwen2501 kwen2501 changed the title [PGNCCL] Use long/short wait with different non-blocking calls [PGNCCL] Use long/short wait for different non-blocking calls Dec 9, 2024
@kwen2501
Copy link
Contributor Author

kwen2501 commented Dec 9, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
…h#142291)

In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function.
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness.

Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

Pull Request resolved: pytorch#142291
Approved by: https://github.com/eqy, https://github.com/fduwjj
@github-actions github-actions bot deleted the gh/kwen2501/113/head branch January 9, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants