-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[PGNCCL] Record device index for GPU guarding during NCCLComm method calls #141270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…calls [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141270
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 90b8360 with merge base 740d1eb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…omm method calls" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…omm method calls" ### Motivation `ncclCommInitRank` needs GPU guard (documented in NCCL). `ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests. ### Solution This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs. ### Note This is not a bug fix. Just a safety improvement. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
| LOG(INFO) << "Rank " << source->rank_ << ": split from parent comm " | ||
| << source->repr() << " with color_id " << color_id << " and rank " | ||
| << rank; | ||
| at::cuda::OptionalCUDAGuard gpuGuard(source->deviceIndex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fail if called with a device index of -1? So we would never init a comm with an uninitialized device index, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all of our created comms should have recorded a device index. If this line can fail by itself, then it is great (better than silent pass-through).
|
@pytorchbot merge -f "CI was previously green; rebased accidentally" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| auto& devName = it.first; | ||
| auto& ncclComm = it.second; | ||
| at::cuda::OptionalCUDAGuard gpuGuard; | ||
| at::DeviceIndex deviceIndex = getIndexFromDeviceKey(devName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also remove getIndexFromDeviceKey?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh do you mean this function has no use now? If so, yeah, we should remove it
…calls (pytorch#141270) ### Motivation `ncclCommInitRank` needs GPU guard (documented in NCCL). `ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests. ### Solution This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs. ### Note This is not a bug fix. Just a safety improvement. Pull Request resolved: pytorch#141270 Approved by: https://github.com/eqy ghstack dependencies: pytorch#141374
…calls [ghstack-poisoned] ghstack-source-id: 4ce17c2 Pull Request resolved: pytorch/pytorch#141270
Stack from ghstack (oldest at bottom):
Motivation
ncclCommInitRankneeds GPU guard (documented in NCCL).ncclCommAbort,ncclCommFinalizeandncclCommDestroymay also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.Solution
This PR records a device index during
NCCLCommobject creation, so that we can add GPU guard inNCCLComm's method calling which direct to the above NCCL APIs.Note
This is not a bug fix. Just a safety improvement.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o