-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Signal handling in DataLoader workers; Timeout option #3474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. Can you add an automated test:
- That the main process exits if the child segfaults
- That the main process exits if the get times out (use a very small timeout)
You can probably trigger a segfault in the child with ctypes.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Does it address the possible hangs of |
|
@vadimkantorov It addresses it in two cases:
|
|
@ssnl Regarding (2): There seem at least two places where the loader can hang: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about the safety and interactions of this patch. Signals are a process-wide mechanism and might interact both with Python runtime, and mix if you have multiple data loaders. Handling SIGCHLD makes sense (if only it doesn't break multiprocessing), but I'd rather not use SIGALRM
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/Module.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
There're two kinds of possible solutions for cross-platform SIGALARM.
|
|
One of the link @peterjc123 mentioned (https://eli.thegreenplace.net/2011/08/22/how-not-to-set-a-timeout-on-a-computation-in-python) makes me wonder if our |
|
@vadimkantorov Sorry I missed your comment earlier. Why should |
|
@ssnl Indeed, only the main proc puts to that queue and it may lock during doing so. It will lock if DataLoader puts more than 64Kb (SimpleQueue is implemented by default using system pipes, if I understand correctly). This may happen in the extreme case when index is some serialized object or a very long Python array. You can easily repro this by creating a SimpleQueue and putting Then, if for some reason the reason the workers are blocked too, no-one will read from the other side of I tried to showcase this scenario here but hit another problem first: #2474 (comment) Having the timeout option cover this case will be a great safeguard. |
|
@vadimkantorov Thanks for the explanation! Unfortunately this PR won't solve that issue. Hopefully such an extreme issue won't come up often :) |
|
@ssnl This would happen even if indices object is small, but the workers for some reason are very slow or locked (it just would take some time to fill 64Kb queue). Can we detect if the queue contains a near-critical number of bytes before putting on it? and showing a UserWarning the first time it happens Though it would sure complicate the logic. In practice this situation happens when |
|
@vadimkantorov I see. But that is not easy to do in this PR, especially now that I moved away from using SIGALRM due to concerns raised by @apaszke . I think adding another thread to data loader will solve it, but I'm not sure if it is necessary. :) I'll keep this in mind, thanks for bringing it up! |
|
Addressing the comments by @colesbury and @apaszke , I updated PR:
|
torch/csrc/jit/ir.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
@ssnl I saw you used queue.Queue (instead of multiprocessing.SimpleQueue for data_queue if timeout > 0: https://github.com/SsnL/pytorch/blob/55158acd5c8c51e6ebad13f52a53c108580d8caf/torch/utils/data/dataloader.py#L176-L177 Could you use queue.Queue for index_queue in that case as well? put method also has an optional timeout argument And probably queue.Queue has a larger message size compared to multiprocessing.SimpleQueue. |
|
Yes I used that. But Queue is already there for interthread communication.
I just extended its usage. The inter process communication is still via
SimpleQueue. Adding one for index_queue would mean adding another thread
just for timeout. I’m not sure that is what we want. Maybe @colesbury or
@apaszke has more thoughts on this?
…On Mon, Nov 6, 2017 at 19:42 Vadim Kantorov ***@***.***> wrote:
@ssnl <https://github.com/ssnl> I saw you used queue.Queue (instead of
multiprocessing.SimpleQueue for data_queue if timeout > 0:
https://github.com/SsnL/pytorch/blob/55158acd5c8c51e6ebad13f52a53c108580d8caf/torch/utils/data/dataloader.py#L176-L177
Could you use queue.Queue for index_queue in that case as well? put method
also has an optional timeout argument
And probably queue.Queue has a larger message size compared to
multiprocessing.SimpleQueue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3474 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZZriAcCMSiL_yvJ5DuoxE1AHvIqaks5sz6dngaJpZM4QRtCP>
.
|
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
test/test_dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
6715db9 to
afe8bf1
Compare
|
Updated PR for @colesbury 's and @apaszke 's comments:
|
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
aed0c3b to
d37c716
Compare
|
After thinking about this, I realize that we can safely use Python's signal module to register a pure Python side SIGCHLD handler because SIGCHLD is not an error and thus not executing in C is not an issue. In the Python SIGCHLD handler, I call a C side function which then access a map of id(DataLoader)->set_of_worker_pids. The function checks each worker of each loader, and throws a runtime error if any of them errored. |
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
If you're going to move the handler to Python, why can't you move all the data structures to Python too? |
|
@apaszke Because |
|
Oh no these gloo changes again D: How did you solve these last time @soumith ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks much better. I have a few minor comments, but should be good to go after that
torch/csrc/DataLoader.cpp
Outdated
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/csrc/DataLoader.cpp
Outdated
| (infop.si_code == CLD_KILLED) || | ||
| (infop.si_code == CLD_DUMPED)) { | ||
| std::ostringstream oss; | ||
| oss << "DataLoader worker (pid " << pid << ") exited unexpectedly."; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| try: | ||
| signal.signal(signal.SIGCHLD, handler) | ||
| except ValueError as _: | ||
| return # Windows doesn't support this |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| def handler(signum, frame): | ||
| _error_if_any_worker_fails() | ||
| if callable(previous_handler): | ||
| previous_handler(signum, frame) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
|
|
||
|
|
||
| _use_shared_memory = False | ||
| _SIGCHLD_handler_set = False |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| # if all workers hang, no None is sent to worker_manager_thread, we | ||
| # put None to let worker_manager_thread exit | ||
| # empty check prevents put from hanging | ||
| if self.worker_result_queue.empty(): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| self.done_event.set() | ||
| # if worker_manager_thread is waiting to put | ||
| if not self.data_queue.empty(): | ||
| self.data_queue.get() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
|
|
||
| def __del__(self): | ||
| if self.num_workers > 0: | ||
| self._remove_worker_pids_information() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| r = index_queue.get() | ||
| if r is None: | ||
| data_queue.put(None) | ||
| break |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. The only thing I thought about now is that I think WNOWAIT is a Linux-only thing. It would be good to verify that this diff doesn't break OS X builds.
|
The set was not properly cleared. This is fixed in new commit. @apaszke I tested on OSX and they all work just as in CentOS except that the main process can't get what signal kills the worker process. The exit code != 0 case is still fine. I don't think it is a big issue. Here is an output comparison: OSX: CentOS |
|
Sorry I forgot error checking for reset signal handler in worker handler. I also shouldn't use the nonportable signal(2). Fixed in new commit. |
| if (sigemptyset(&sa.sa_mask) != 0 || sigaction(SIGNAL, &sa, NULL) != 0) { \ | ||
| _exit(EXIT_FAILURE); \ | ||
| } else { \ | ||
| raise(SIGNAL); \ |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| sa.sa_flags = SA_RESTART|SA_SIGINFO|SA_NOCLDSTOP; | ||
| sigemptyset(&sa.sa_mask); | ||
| if (sigaction(signal, &sa, old_sa_ptr) != 0) { | ||
| sa.sa_flags = SA_RESTART|SA_SIGINFO|SA_NOCLDSTOP|SA_NODEFER; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Test plan (modified from script by @fmassa in #1595 ):
script:
output:
script:
output:
Since this is deep into multiprocessing and low level signal handling, there might be cases/things I haven't considered. I'll also leave some comment below.
cc @apaszke @colesbury