-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown #141511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141511
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 46975d0 with merge base 61dc5e9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…r shutdown" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…r shutdown" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…r shutdown" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…r shutdown" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…r shutdown" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
// First print warning on first rank of each node | ||
if (rank_ % localDeviceCount_ == 0) { | ||
TORCH_WARN_ONCE( | ||
"WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should shorten this message too. I think the second half is not needed. We could also poitn users to a docpage that has more detail if we need the detail. I'd like it better if it were more like
"WARNING: destroy_process_group() was not called before shutdown, which can leak resources. For more info, see ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #142297 for messaging improvement.
… with CUDA driver shutdown" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
#else | ||
C10D_NCCL_ASSERT(::ncclCommDestroy(ncclComm_)); | ||
#endif | ||
TORCH_WARN_ONCE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the direction of doing a warning here.
but just so i'm clear- this PR does take a destroy call away, and doesn't add a new one. Is the basis for this that if users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. Oh sorry, I missed to fill out the PR description. Adding it now.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it. Pull Request resolved: #142297 Approved by: https://github.com/wconstab ghstack dependencies: #141510, #141511
Stack from ghstack (oldest at bottom):
Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.
this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o