KEMBAR78
notebook: don’t filter polled instances by PID by wchargin · Pull Request #4407 · tensorflow/tensorboard · GitHub
Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Nov 30, 2020

Summary:
When the %tensorboard cell magic is invoked, we compute a cache key
for the “hermetic environment”, primarily args to %tensorboard and the
working directory. We first check whether any running TensorBoard
instances match that cache key, and launch a new instance if none do.
But then, while polling for the new instance to have launched, we had a
different matching criterion, checking for a process ID match instead of
a cache key match.

The idea was that “is this TensorBoard instance’s PID equal to the PID
of the subprocess that we just spawned?” would be a more reliable check.
But on Windows ((╯°□°)╯︵ ┻━┻) this is not the case, presumably because
the tensorboard console script has some kind of wrapper process in
certain versions of Python. This manifested as “%tensorboard always
times out on the first invocation, but works immediately when I invoke
it again”, since invoking it again triggers the cache key check rather
than the PID check. So we now just check by cache key in all cases, and
the logic is consistent, if a bit less precise overall.

Fixes #4300.

Test Plan:
Still works for me on Linux, with both new and existing TensorBoard
processes across multiple (concurrent) cache keys. @stephanwlee can
repro the bug and fix on Windows with Python 3.8.

wchargin-branch: notebook-poll-no-pid-filter

Summary:
When the `%tensorboard` cell magic is invoked, we compute a cache key
for the “hermetic environment”, primarily args to `%tensorboard` and the
working directory. We first check whether any running TensorBoard
instances match that cache key, and launch a new instance if none do.
But then, while polling for the new instance to have launched, we had a
different matching criterion, checking for a process ID match instead of
a cache key match.

The idea was that “is this TensorBoard instance’s PID equal to the PID
of the subprocess that we just spawned?” would be a more reliable check.
But on Windows ((╯°□°)╯︵ ┻━┻) this is not the case, presumably because
the `tensorboard` console script has some kind of wrapper process in
certain versions of Python. This manifested as “`%tensorboard` always
times out on the first invocation, but works immediately when I invoke
it again”, since invoking it again triggers the cache key check rather
than the PID check. So we now just check by cache key in all cases, and
the logic is consistent, if a bit less precise overall.

Test Plan:
Still works for me on Linux, with both new and existing TensorBoard
processes across multiple (concurrent) cache keys. Would be great for a
Windows user to confirm repro and fix…

wchargin-branch: notebook-poll-no-pid-filter
wchargin-source: 85ff3ae7dd6ecab3f6580ef0c0611a1e09ef0630
Copy link
Contributor

@stephanwlee stephanwlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to confirm the fix on Windows (Python 3.8) with the patch. Thanks for the fix.

@wchargin
Copy link
Contributor Author

wchargin commented Dec 1, 2020

Awesome; thanks very much for testing!

@wchargin wchargin merged commit 546d1b6 into master Dec 1, 2020
@wchargin wchargin deleted the wchargin-notebook-poll-no-pid-filter branch December 1, 2020 18:47
wchargin added a commit that referenced this pull request Dec 1, 2020
Summary:
CI skew between #4409 and #4407 that I didn’t realize.

Test Plan:
GitHub CI suffices.

wchargin-branch: black-reformat-manager
wchargin-source: 78c9403c871588dc6c4d4f93783e1f332a577164
wchargin added a commit that referenced this pull request Dec 2, 2020
Summary:
CI skew between #4409 and #4407 that I didn’t realize.

Test Plan:
GitHub CI suffices.

wchargin-branch: black-reformat-manager
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Starting tensorboard inline within a Jupyter notebook consistently times out

2 participants