KEMBAR78
[torchelastic] ensure grandchild processes are restarted correctly by zdevito · Pull Request #113231 · pytorch/pytorch · GitHub
Skip to content

Conversation

@zdevito
Copy link
Contributor

@zdevito zdevito commented Nov 8, 2023

Stack from ghstack (oldest at bottom):

When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.

When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delived only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 8, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113231

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f5c3318 with merge base dbb96ef (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

zdevito added a commit that referenced this pull request Nov 8, 2023
When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delived only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.

ghstack-source-id: 5672a1d
Pull Request resolved: #113231
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kiukchung @d4l3k if you have feedback as well

def _popen(self, args: Tuple, env: Dict[str, str]) -> subprocess.Popen:
kwargs = {}
if not IS_WINDOWS:
kwargs['start_new_session'] = True
Copy link
Member

@H-Huang H-Huang Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new processes are started this way, will they be detached from the parent? For example, if I run something like torchrun my_script.py --nnodes=1 --nproc-per-node=2 and then I Ctrl+C, will the child processes still be able to exit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The will be in a different session than the parent, but torchrun is a kind of like bash in that it is a process manager, so when it gets a SIGINT, it will propagate that to its children before existing. So when you Ctrl-C torchrun you get:

[2023-11-09 10:53:46,286] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers

And the workers will stop too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for confirming!

@H-Huang H-Huang requested review from d4l3k and kiukchung November 9, 2023 18:12
…orrectly"


When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.

[ghstack-poisoned]
…orrectly"


When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Nov 19, 2023
I've run into a couple of cases now where max_split_size_mb has been set
in projects as a workaround for fragmentation but it ends up causing problems
later, such as degraded performance from freeing empty segments. While it
is a useful setting to have, expandable_segments is probably a better first
resort for fixing fragmentation since when it works it is less likely to
need synchronous GPU operations to continue running.
Pull Request resolved: #113481
Approved by: https://github.com/msaroufim, https://github.com/albanD
ghstack dependencies: #113231
@facebook-github-bot facebook-github-bot deleted the gh/zdevito/252/head branch November 22, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants