KEMBAR78
[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown by kwen2501 · Pull Request #141511 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Nov 25, 2024

Stack from ghstack (oldest at bottom):

Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.

this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141511

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 46975d0 with merge base 61dc5e9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 25, 2024
kwen2501 added a commit that referenced this pull request Nov 25, 2024
…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 25, 2024
…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 27, 2024
…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 29, 2024
…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Dec 4, 2024
…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501 kwen2501 changed the title [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown [PGNCCL] Remove CUDA calls from destructors to avoid clash with CUDA driver shutdown Dec 5, 2024
@kwen2501 kwen2501 requested review from eqy, fduwjj and wconstab and removed request for wconstab December 5, 2024 20:53
// First print warning on first rank of each node
if (rank_ % localDeviceCount_ == 0) {
TORCH_WARN_ONCE(
"WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should shorten this message too. I think the second half is not needed. We could also poitn users to a docpage that has more detail if we need the detail. I'd like it better if it were more like

"WARNING: destroy_process_group() was not called before shutdown, which can leak resources. For more info, see ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #142297 for messaging improvement.

… with CUDA driver shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501 kwen2501 changed the title [PGNCCL] Remove CUDA calls from destructors to avoid clash with CUDA driver shutdown [PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown Dec 7, 2024
@kwen2501 kwen2501 requested review from c-p-i-o and wconstab December 7, 2024 06:32
#else
C10D_NCCL_ASSERT(::ncclCommDestroy(ncclComm_));
#endif
TORCH_WARN_ONCE(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the direction of doing a warning here.

but just so i'm clear- this PR does take a destroy call away, and doesn't add a new one. Is the basis for this that if users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Oh sorry, I missed to fill out the PR description. Adding it now.

@kwen2501
Copy link
Contributor Author

kwen2501 commented Dec 9, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 9, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Dec 10, 2024
And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it.

Pull Request resolved: #142297
Approved by: https://github.com/wconstab
ghstack dependencies: #141510, #141511
@github-actions github-actions bot deleted the gh/kwen2501/105/head branch January 9, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants