KEMBAR78

[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown by kwen2501 · Pull Request #141511 · pytorch/pytorch · GitHub

[PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown #141511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

kwen2501 wants to merge 7 commits into gh/kwen2501/105/base from gh/kwen2501/105/head

Contributor

kwen2501 commented Nov 25, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.

this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o


          [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown

b4e64ee

[ghstack-poisoned]

This was referenced Nov 25, 2024

[c10d][CI] Use new store for PG restart tests #141374

Closed

[PGNCCL] Record device index for GPU guarding during NCCLComm method calls #141270

Closed

pytorch-bot bot commented Nov 25, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141511

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

MacOS tests has not been running for few weeks

✅ No Failures

As of commit 46975d0 with merge base 61dc5e9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 mentioned this pull request

[c10d] Test needs abort; otherwise will hang #141509

Closed

pytorch-bot bot added oncall: distributed release notes: distributed (c10d) labels

kwen2501 mentioned this pull request

[PGNCCL] Fix behavior of destroy_process_group #141510

Closed

kwen2501 added a commit that referenced this pull request


          [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown

7b18bec

ghstack-source-id: 43e6ed9
Pull Request resolved: #141511


          Update on "[PGNCCL] Rework destructors to avoid clash with CUDA drive…

9e0b94b

…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request


          [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown

df6915e

ghstack-source-id: 03f161f
Pull Request resolved: #141511


          Update on "[PGNCCL] Rework destructors to avoid clash with CUDA drive…

bf83d42

…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request


          [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown

bf68112

ghstack-source-id: b42d124
Pull Request resolved: #141511


          Update on "[PGNCCL] Rework destructors to avoid clash with CUDA drive…

cae2a33

…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request


          [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown

f5b8af3

ghstack-source-id: bb6db78
Pull Request resolved: #141511


          Update on "[PGNCCL] Rework destructors to avoid clash with CUDA drive…

fbe69f3

…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request


          [PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown

b3b120c

ghstack-source-id: 2a0c58d
Pull Request resolved: #141511


          Update on "[PGNCCL] Rework destructors to avoid clash with CUDA drive…

9966c0b

…r shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

kwen2501 changed the title ~~[PGNCCL] Rework destructors to avoid clash with CUDA driver shutdown~~ [PGNCCL] Remove CUDA calls from destructors to avoid clash with CUDA driver shutdown

kwen2501 requested review from eqy, fduwjj and wconstab and removed request for wconstab

December 5, 2024 20:53

wconstab reviewed

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated

    
                // First print warning on first rank of each node

                if (rank_ % localDeviceCount_ == 0) {

                  TORCH_WARN_ONCE(

                      "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. ",

Contributor

wconstab Dec 7, 2024

nit: maybe we should shorten this message too. I think the second half is not needed. We could also poitn users to a docpage that has more detail if we need the detail. I'd like it better if it were more like

"WARNING: destroy_process_group() was not called before shutdown, which can leak resources. For more info, see ..."

Contributor Author

kwen2501 Dec 7, 2024

Opened #142297 for messaging improvement.

wconstab reviewed

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved


          Update on "[PGNCCL] Remove CUDA calls from destructors to avoid clash…

46975d0

… with CUDA driver shutdown"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

kwen2501 mentioned this pull request

Improve messaging of ProcessGroupNCCL destructor #142297

Closed

kwen2501 changed the title ~~[PGNCCL] Remove CUDA calls from destructors to avoid clash with CUDA driver shutdown~~ [PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown

kwen2501 requested review from c-p-i-o and wconstab

December 7, 2024 06:32

wconstab reviewed

View reviewed changes

torch/csrc/distributed/c10d/NCCLUtils.hpp

    
              #else

                    C10D_NCCL_ASSERT(::ncclCommDestroy(ncclComm_));

              #endif

                    TORCH_WARN_ONCE(

Contributor

wconstab Dec 9, 2024

i like the direction of doing a warning here.

but just so i'm clear- this PR does take a destroy call away, and doesn't add a new one. Is the basis for this that if users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning)

Contributor Author

kwen2501 Dec 9, 2024

Yes, that's right. Oh sorry, I missed to fill out the PR description. Adding it now.

eqy approved these changes

View reviewed changes

Contributor Author

kwen2501 commented Dec 9, 2024

@pytorchbot merge

pytorch-bot bot added the ciflow/trunk label

wconstab approved these changes

View reviewed changes

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Dec 9, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

5d3bc63

pytorchmergebot removed the merging label

pytorchmergebot pushed a commit that referenced this pull request


          Improve messaging of ProcessGroupNCCL destructor (#142297)

5743b11

And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it.

Pull Request resolved: #142297
Approved by: https://github.com/wconstab
ghstack dependencies: #141510, #141511

github-actions bot deleted the gh/kwen2501/105/head branch

January 9, 2025 02:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Merged oncall: distributed release notes: distributed (c10d)