-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Improve messaging of ProcessGroupNCCL destructor #142297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142297
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit db787ce with merge base 61dc5e9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
if (ncclHeartbeatMonitorThread_.joinable()) { | ||
ncclHeartbeatMonitorThread_.join(); | ||
VLOG(2) << logPrefix() | ||
if (ncclCommWatchdogThread_.joinable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats the reasoning for removing the conditional on !blockingWait_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If blockingWait_ is true, we would not have spawned the side threads, and joinable()
should return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. for all the threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a general thing. If a std::thread
is just declared but not spawned, it should not be joinable()
. We declared all these threads in ProcessGroupNCCL.hpp.
// Here we route to `abort()` explicitly to maintain the old behavior, until | ||
// we fix everything. | ||
abort(); | ||
if (terminateProcessGroup_.load()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc this is supposed to be the same logic as above? (no behavior change?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, no behavior change. Just removing condition nesting.
|
||
// Note: we have rewritten `shutdown` to represent the destroy behavior | ||
// (blocking). Thus we route to `abort()` explicitly here to maintain the | ||
// old behavior. We should re-validate the choice and document the rationale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this comment more explicit? ("the old behavior", "have rewritten shutdown" both a bit vague)
what is the TODO part, do we want to remove this abort or do we want to switch this abort to a shutdown? and why..
And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
@pytorchbot merge -f "CI was green; just changing comments" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
And removed some unnecessary conditions for calling
thread.join()
--thread.joinable()
should have covered it.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o