-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] ProcessGroupGloo: support per operation timeouts #158128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158128
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 2862077 with merge base b4476ca ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Was there any changes needed on the Gloo side? Or was this PR mostly to make sure the plumbing to gloo is correct
test/distributed/test_c10d_gloo.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a test case where the collective timeout takes precedence over the pg timeout thats set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done -- I did some cleanups and exposed set_timeout on ProcessGroup/Backend so we could test default operation timeout without causing issues with short timeout during init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the timeout from context_->getTimeout() coming from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming from the overall PG timeout that's set when the PG is created. There's a global timeout on the gloo Context
|
@H-Huang no Gloo side changes required, just a matter of plumbing things correctly. Was surprised to see that this wasn't plumbed correctly before |
1a20ccb to
ff16b99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense and LGTM
ff16b99 to
8814ea0
Compare
8814ea0 to
2862077
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.
kUnsetTimeoutand conditionally uses the provided timeout or the default timeout from the context.set_timeoutas a standard method on ProcessGroup/Backend so we can test the global timeout.Test plan:
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab