[torchelastic] ensure grandchild processes are restarted correctly #113231

zdevito · 2023-11-08T02:50:08Z

Stack from ghstack (oldest at bottom):

When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.

When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delived only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. [ghstack-poisoned]

pytorch-bot · 2023-11-08T02:50:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113231

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f5c3318 with merge base dbb96ef ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delived only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. ghstack-source-id: 5672a1d Pull Request resolved: #113231

H-Huang

cc @kiukchung @d4l3k if you have feedback as well

H-Huang · 2023-11-09T18:11:45Z

torch/distributed/elastic/multiprocessing/api.py

    def _popen(self, args: Tuple, env: Dict[str, str]) -> subprocess.Popen:
+        kwargs = {}
+        if not IS_WINDOWS:
+            kwargs['start_new_session'] = True


if new processes are started this way, will they be detached from the parent? For example, if I run something like torchrun my_script.py --nnodes=1 --nproc-per-node=2 and then I Ctrl+C, will the child processes still be able to exit?

The will be in a different session than the parent, but torchrun is a kind of like bash in that it is a process manager, so when it gets a SIGINT, it will propagate that to its children before existing. So when you Ctrl-C torchrun you get:

[2023-11-09 10:53:46,286] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers

And the workers will stop too.

Sounds good, thanks for confirming!

…orrectly" When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delivered only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. [ghstack-poisoned]

I've run into a couple of cases now where max_split_size_mb has been set in projects as a workaround for fragmentation but it ends up causing problems later, such as degraded performance from freeing empty segments. While it is a useful setting to have, expandable_segments is probably a better first resort for fixing fragmentation since when it works it is less likely to need synchronous GPU operations to continue running. Pull Request resolved: #113481 Approved by: https://github.com/msaroufim, https://github.com/albanD ghstack dependencies: #113231

H-Huang reviewed Nov 9, 2023

View reviewed changes

H-Huang requested review from d4l3k and kiukchung November 9, 2023 18:12

H-Huang approved these changes Nov 10, 2023

View reviewed changes

zdevito mentioned this pull request Nov 10, 2023

Don't recommmend max_split_size_mb first #113481

Closed

pytorchmergebot added the Merged label Nov 19, 2023

pytorchmergebot closed this in d968c4c Nov 19, 2023

facebook-github-bot deleted the gh/zdevito/252/head branch November 22, 2023 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torchelastic] ensure grandchild processes are restarted correctly #113231

[torchelastic] ensure grandchild processes are restarted correctly #113231

Uh oh!

zdevito commented Nov 8, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 8, 2023 •

edited

Loading

Uh oh!

H-Huang left a comment

Uh oh!

H-Huang Nov 9, 2023 •

edited

Loading

Uh oh!

zdevito Nov 9, 2023

Uh oh!

H-Huang Nov 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[torchelastic] ensure grandchild processes are restarted correctly #113231

[torchelastic] ensure grandchild processes are restarted correctly #113231

Uh oh!

Conversation

zdevito commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113231

✅ No Failures

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zdevito Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang Nov 10, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zdevito commented Nov 8, 2023 •

edited

Loading

pytorch-bot bot commented Nov 8, 2023 •

edited

Loading

H-Huang Nov 9, 2023 •

edited

Loading