[PGNCCL] Fix behavior of destroy_process_group #141510

kwen2501 · 2024-11-25T19:56:08Z

Stack from ghstack (oldest at bottom):

Today destroy_process_group() is implemented via ncclCommAbort.
When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption.

Instead of aborting kernels, we should flush collectives in destroy_process_group, i.e. let them complete normally, before we tear down resources.

This PR implements such "flushing" behavior using ncclCommFinalize, then reclaims resources via ncclCommDestroy.

Expected behaviors:
For a bad program, a hang is expected at destroy_process_group(). If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-11-25T19:56:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141510

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ecdf475 with merge base 61dc5e9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

torch/csrc/distributed/c10d/NCCLUtils.cpp

wconstab · 2024-12-03T23:41:03Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    // Note: we have rewritten `shutdown` to represent the destroy behavior.
+    // Here we route to `abort()` explicitly to maintain the old behavior, until
+    // we fix everything.
+    abort();


its good that this codepath tries to preserve legacy behavior, since otherwise we would risk introducing hangs on shutdown where they didn't exist before.

otoh, is shutdown() considered a public API surface itself?

Should we consider
(1) making 'wait' a flag for existing shutdown API (e.g. shutdown(wait_on_ops=False)) to make sure we always preserve BC
(2) just leave shutdown alone but add a new method for 'clean shutdown' and mark shutdown as deprecated?

shutdown is not a public API per my understanding.
It was pybind'ed as _shutdown, and then used in dist.destroy_process_group.

pytorch/torch/csrc/distributed/c10d/init.cpp

Lines 2911 to 2916 in c47dae8

.def(

"_shutdown",

[](const c10::intrusive_ptr<::c10d::ProcessGroupNCCL>& self) {

return self->shutdown();

},

py::call_guard<py::gil_scoped_release>())

(Thus the name shutdown doesn't really matter -- the fact it gets used by destroy_process_group means that it should carry the flush + destroy behavior -- no destroy is possible without waiting for ops to finish.)

For behavior like wait_on_ops=False, user should be directed to use abort_process_group instead, I think.

Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

fduwjj · 2024-12-04T19:38:36Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+// Difference between `abort()` and `shutdown()`:
+// 1. `abort()` will signal communicators to terminate all NCCL kernels
+// immediately.
+// 2. `shutdown()` will wait for all NCCL kernels to finish before destroying


Is shutdown blocking?

Yes, it is blocking by purpose.

fduwjj

Do you want to add a unit test for this?

kwen2501 · 2024-12-04T19:47:47Z

Do you want to add a unit test for this?

test_c10d_nccl.py has ~ 50 calls of destroy_process_group. I think we can rely on them to test this change.

kwen2501 · 2024-12-04T20:27:51Z

@pytorchbot merge -f "CI was green previously; new change just fixes typo"

pytorchmergebot · 2024-12-04T20:30:33Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Today `destroy_process_group()` is implemented via `ncclCommAbort`. When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption. Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources. This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`. Expected behaviors: For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior. Pull Request resolved: pytorch#141510 Approved by: https://github.com/wconstab

#141511) Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error. this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning). Pull Request resolved: #141511 Approved by: https://github.com/eqy, https://github.com/wconstab ghstack dependencies: #141510

And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it. Pull Request resolved: #142297 Approved by: https://github.com/wconstab ghstack dependencies: #141510, #141511

[PGNCCL] Implement destroy behavior in shutdown()

3836b2f

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 25, 2024

kwen2501 added 2 commits November 25, 2024 14:08

Update on "[PGNCCL] Implement destroy behavior in shutdown()"

6bf1770

[ghstack-poisoned]

Update on "[PGNCCL] Implement destroy behavior in shutdown()"

a28a22c

[ghstack-poisoned]

kwen2501 added keep-going Don't stop on first failure, keep running tests until the end ciflow/trunk Trigger trunk jobs on your pull request labels Nov 27, 2024

Update on "[PGNCCL] Implement destroy behavior in shutdown()"

22b4c0b

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 requested review from fduwjj, shuqiangzhang and wconstab December 3, 2024 22:53

kwen2501 changed the title ~~[PGNCCL] Implement destroy behavior in shutdown()~~ [PGNCCL] Fix behavior of destroy_process_group Dec 4, 2024

wconstab reviewed Dec 4, 2024

View reviewed changes

fduwjj reviewed Dec 4, 2024

View reviewed changes

wconstab approved these changes Dec 4, 2024

View reviewed changes

pytorchmergebot added the merging label Dec 4, 2024

pytorchmergebot closed this in f24a9d0 Dec 4, 2024

pytorchmergebot added Merged and removed merging labels Dec 4, 2024

kwen2501 mentioned this pull request Dec 7, 2024

Improve messaging of ProcessGroupNCCL destructor #142297

Closed

github-actions bot deleted the gh/kwen2501/104/head branch January 4, 2025 02:06

kwen2501 mentioned this pull request Feb 3, 2025

multi node Error when dist.destroy_process_group #146021

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PGNCCL] Fix behavior of destroy_process_group #141510

[PGNCCL] Fix behavior of destroy_process_group #141510

Uh oh!

kwen2501 commented Nov 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

wconstab Dec 3, 2024

Uh oh!

kwen2501 Dec 4, 2024 •

edited

Loading

Uh oh!

fduwjj Dec 4, 2024

Uh oh!

kwen2501 Dec 4, 2024

Uh oh!

fduwjj left a comment

Uh oh!

kwen2501 commented Dec 4, 2024

Uh oh!

kwen2501 commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	.def(
	"_shutdown",
	[](const c10::intrusive_ptr<::c10d::ProcessGroupNCCL>& self) {
	return self->shutdown();
	},
	py::call_guard<py::gil_scoped_release>())

[PGNCCL] Fix behavior of destroy_process_group #141510

[PGNCCL] Fix behavior of destroy_process_group #141510

Uh oh!

Conversation

kwen2501 commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141510

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

wconstab Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Dec 4, 2024

Uh oh!

kwen2501 commented Dec 4, 2024

Uh oh!

pytorchmergebot commented Dec 4, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Nov 25, 2024 •

edited

Loading

pytorch-bot bot commented Nov 25, 2024 •

edited

Loading

kwen2501 Dec 4, 2024 •

edited

Loading