-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[PGNCCL] Fix behavior of destroy_process_group #141510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141510
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ecdf475 with merge base 61dc5e9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
[ghstack-poisoned]
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
| // Note: we have rewritten `shutdown` to represent the destroy behavior. | ||
| // Here we route to `abort()` explicitly to maintain the old behavior, until | ||
| // we fix everything. | ||
| abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its good that this codepath tries to preserve legacy behavior, since otherwise we would risk introducing hangs on shutdown where they didn't exist before.
otoh, is shutdown() considered a public API surface itself?
Should we consider
(1) making 'wait' a flag for existing shutdown API (e.g. shutdown(wait_on_ops=False)) to make sure we always preserve BC
(2) just leave shutdown alone but add a new method for 'clean shutdown' and mark shutdown as deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdown is not a public API per my understanding.
It was pybind'ed as _shutdown, and then used in dist.destroy_process_group.
pytorch/torch/csrc/distributed/c10d/init.cpp
Lines 2911 to 2916 in c47dae8
| .def( | |
| "_shutdown", | |
| [](const c10::intrusive_ptr<::c10d::ProcessGroupNCCL>& self) { | |
| return self->shutdown(); | |
| }, | |
| py::call_guard<py::gil_scoped_release>()) |
(Thus the name
shutdown doesn't really matter -- the fact it gets used by destroy_process_group means that it should carry the flush + destroy behavior -- no destroy is possible without waiting for ops to finish.)
For behavior like wait_on_ops=False, user should be directed to use abort_process_group instead, I think.
Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
| // Difference between `abort()` and `shutdown()`: | ||
| // 1. `abort()` will signal communicators to terminate all NCCL kernels | ||
| // immediately. | ||
| // 2. `shutdown()` will wait for all NCCL kernels to finish before destroying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is shutdown blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is blocking by purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a unit test for this?
|
|
@pytorchbot merge -f "CI was green previously; new change just fixes typo" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. Pull Request resolved: pytorch#141510 Approved by: https://github.com/wconstab
Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. Pull Request resolved: pytorch#141510 Approved by: https://github.com/wconstab
#141511) Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error. this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning). Pull Request resolved: #141511 Approved by: https://github.com/eqy, https://github.com/wconstab ghstack dependencies: #141510
And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it. Pull Request resolved: #142297 Approved by: https://github.com/wconstab ghstack dependencies: #141510, #141511
Stack from ghstack (oldest at bottom):
Today
destroy_process_group()is implemented viancclCommAbort.When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption.
Instead of aborting kernels, we should flush collectives in
destroy_process_group, i.e. let them complete normally, before we tear down resources.This PR implements such "flushing" behavior using
ncclCommFinalize, then reclaims resources viancclCommDestroy.Expected behaviors:
For a bad program, a hang is expected at
destroy_process_group(). If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o