-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[torchelastic] ensure grandchild processes are restarted correctly #113231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delived only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113231
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f5c3318 with merge base dbb96ef ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delived only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. ghstack-source-id: 5672a1d Pull Request resolved: #113231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kiukchung @d4l3k if you have feedback as well
| def _popen(self, args: Tuple, env: Dict[str, str]) -> subprocess.Popen: | ||
| kwargs = {} | ||
| if not IS_WINDOWS: | ||
| kwargs['start_new_session'] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if new processes are started this way, will they be detached from the parent? For example, if I run something like torchrun my_script.py --nnodes=1 --nproc-per-node=2 and then I Ctrl+C, will the child processes still be able to exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The will be in a different session than the parent, but torchrun is a kind of like bash in that it is a process manager, so when it gets a SIGINT, it will propagate that to its children before existing. So when you Ctrl-C torchrun you get:
[2023-11-09 10:53:46,286] torch.distributed.elastic.agent.server.api: [WARNING] Received Signals.SIGINT death signal, shutting down workers
And the workers will stop too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks for confirming!
…orrectly" When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delivered only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. [ghstack-poisoned]
…orrectly" When torchelastic notices that one rank has failed, it will sent a SIGTERM signal to other trainer ranks to tear them down before restarting. However, if the trainer itself launches subprocesses, or is launched by a non-python wrapper script, then the SIGTERM will be delivered only to the direct child of torch eleastic and not all descendants. This opens subprocesses in a new linux 'session' which starts a new process group with the pgid the same as the trainers pid. Then when we send signals, we deliver them to the process group rather than just the direct child. [ghstack-poisoned]
I've run into a couple of cases now where max_split_size_mb has been set in projects as a workaround for fragmentation but it ends up causing problems later, such as degraded performance from freeing empty segments. While it is a useful setting to have, expandable_segments is probably a better first resort for fixing fragmentation since when it works it is less likely to need synchronous GPU operations to continue running. Pull Request resolved: #113481 Approved by: https://github.com/msaroufim, https://github.com/albanD ghstack dependencies: #113231
Stack from ghstack (oldest at bottom):
When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.