KEMBAR78
Fix EventFileWriter deadlock on exception in background thread by crassirostris · Pull Request #6168 · tensorflow/tensorboard · GitHub
Skip to content

Conversation

@crassirostris
Copy link
Contributor

Motivation for features / changes

To address #6167

Technical description of changes

This is a bug fix for possible deadlock when writing events through EventFileWriter. The PR adds logic in _AsyncWriterThread to catch exception to propagate it to the calling thread and adds logic to _AsyncWriter to propagate exception raised in _AsyncWriterThread

Detailed steps to verify changes work correctly (as executed by you)

New unit test that is not passing on master

Alternate designs / implementations considered

  • Instead of popping an item from the queue on exception, it's possible to make wait/flush methods re-check the status periodically
  • Instead of raising an exception in the foreground thread, it's possible to ignore the raised exception altogether and just start dropping events
  • It's possible to drop the data after it cannot be added to the queue for a certain period of time

@crassirostris
Copy link
Contributor Author

@groszewn the issue with build seems to be unrelated and I fixed the linter warning. Could you please re-run the workflow?

@crassirostris
Copy link
Contributor Author

@groszewn sorry, now fixed the linter warning for real:

$ black --check --diff .
Skipping .ipynb files as Jupyter dependencies are not installed.
You can fix this by running ``pip install black[jupyter]``
All done! ✨ 🍰 ✨
366 files would be left unchanged.

Could you please restart CI once again?

Copy link
Contributor

@groszewn groszewn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @crassirostris, really appreciate the well-documented issue and PR! Left a few comments.

Signed-off-by: Mik Vyatskov <vmik@meta.com>
Copy link
Contributor Author

@crassirostris crassirostris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the thorough review, @groszewn! Fixed behavior for flush, could you please take another look?

Copy link
Contributor

@groszewn groszewn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the fix!

@groszewn groszewn merged commit b1ac492 into tensorflow:master Feb 3, 2023
yatbear pushed a commit to yatbear/tensorboard that referenced this pull request Mar 27, 2023
…rflow#6168)

## Motivation for features / changes

To address tensorflow#6167

## Technical description of changes

This is a bug fix for possible deadlock when writing events through
`EventFileWriter`. The PR adds logic in `_AsyncWriterThread` to catch
exception to propagate it to the calling thread and adds logic to
`_AsyncWriter` to propagate exception raised in `_AsyncWriterThread`

## Detailed steps to verify changes work correctly (as executed by you)

New unit test that is not passing on master

## Alternate designs / implementations considered

* Instead of popping an item from the queue on exception, it's possible
to make `wait`/`flush` methods re-check the status periodically
* Instead of raising an exception in the foreground thread, it's
possible to ignore the raised exception altogether and just start
dropping events
* It's possible to drop the data after it cannot be added to the queue
for a certain period of time

Signed-off-by: Mik Vyatskov <vmik@meta.com>
dna2github pushed a commit to dna2fork/tensorboard that referenced this pull request May 1, 2023
…rflow#6168)

## Motivation for features / changes

To address tensorflow#6167

## Technical description of changes

This is a bug fix for possible deadlock when writing events through
`EventFileWriter`. The PR adds logic in `_AsyncWriterThread` to catch
exception to propagate it to the calling thread and adds logic to
`_AsyncWriter` to propagate exception raised in `_AsyncWriterThread`

## Detailed steps to verify changes work correctly (as executed by you)

New unit test that is not passing on master

## Alternate designs / implementations considered

* Instead of popping an item from the queue on exception, it's possible
to make `wait`/`flush` methods re-check the status periodically
* Instead of raising an exception in the foreground thread, it's
possible to ignore the raised exception altogether and just start
dropping events
* It's possible to drop the data after it cannot be added to the queue
for a certain period of time

Signed-off-by: Mik Vyatskov <vmik@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants