KEMBAR78
Improve messaging of ProcessGroupNCCL destructor by kwen2501 · Pull Request #142297 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Dec 7, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142297

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit db787ce with merge base 61dc5e9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 7, 2024
if (ncclHeartbeatMonitorThread_.joinable()) {
ncclHeartbeatMonitorThread_.join();
VLOG(2) << logPrefix()
if (ncclCommWatchdogThread_.joinable()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the reasoning for removing the conditional on !blockingWait_?

Copy link
Contributor Author

@kwen2501 kwen2501 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If blockingWait_ is true, we would not have spawned the side threads, and joinable() should return false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. for all the threads?

Copy link
Contributor Author

@kwen2501 kwen2501 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a general thing. If a std::thread is just declared but not spawned, it should not be joinable(). We declared all these threads in ProcessGroupNCCL.hpp.

// Here we route to `abort()` explicitly to maintain the old behavior, until
// we fix everything.
abort();
if (terminateProcessGroup_.load())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc this is supposed to be the same logic as above? (no behavior change?)

Copy link
Contributor Author

@kwen2501 kwen2501 Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, no behavior change. Just removing condition nesting.


// Note: we have rewritten `shutdown` to represent the destroy behavior
// (blocking). Thus we route to `abort()` explicitly here to maintain the
// old behavior. We should re-validate the choice and document the rationale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this comment more explicit? ("the old behavior", "have rewritten shutdown" both a bit vague)

what is the TODO part, do we want to remove this abort or do we want to switch this abort to a shutdown? and why..

And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Dec 9, 2024
ghstack-source-id: 7eba555
Pull Request resolved: #142297
@kwen2501
Copy link
Contributor Author

kwen2501 commented Dec 9, 2024

@pytorchbot merge -f "CI was green; just changing comments"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/kwen2501/114/head branch January 11, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants