KEMBAR78
text: batch HTML sanitization by wchargin · Pull Request #3529 · tensorflow/tensorboard · GitHub
Skip to content

Conversation

@wchargin
Copy link
Contributor

Summary:
By design, we only expose an API for converting Markdown to safe HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.

Test Plan:
The /data/plugin/text/text route sees 40–60% speedup: on my machine,

  • from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
    order tensors” text demo downsampled to 10 steps; and
  • from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
    dataset with 32 Markdown cells per step downsampled to 100 steps.

wchargin-branch: text-batch-bleach

Summary:
By design, we only expose an API for converting Markdown to *safe* HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.

Test Plan:
The `/data/plugin/text/text` route sees 40–60% speedup: on my machine,

  - from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
    order tensors” text demo downsampled to 10 steps; and
  - from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
    dataset with 32 Markdown cells per step downsampled to 100 steps.

wchargin-branch: text-batch-bleach
@wchargin wchargin added plugin:text theme:performance Performance, scalability, large data sizes, slowness, etc. labels Apr 17, 2020
@wchargin wchargin requested a review from nfelt April 17, 2020 22:43
Comment on lines +98 to +100
also be less precise: if one of the input documents has unsafe HTML
that is sanitized away, that sanitization might affect other
documents, even if those documents are safe.
Copy link
Contributor Author

@wchargin wchargin Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can’t actually find a way to trigger this. What I mean is that if you
could find input strings x, y, and z such that

  • markdown.markdown(x) == '<a target="*chuckles* '
  • markdown.markdown(y) == "I'm in danger"
  • markdown.markdown(z) == '"></a>'

then markdowns_to_safe_html([x, y, z], "".join) would lose the content
of y even though it was itself safe and didn’t “deserve” to get
bleached. But I can’t find any strings x that would create open tags
like that (though admittedly I didn’t try very hard).

If x == "<div>hmm" then markdown.markdown(x) == x, but the same is
not true if x == "<a>hmm", and <div> is not on our Bleach whitelist.

Given that the output is always safe and is unreasonably behaved (but
still safe) only when the input is unsafe, this seems reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to oblige.

---
 tensorboard/plugin_util_test.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tensorboard/plugin_util_test.py b/tensorboard/plugin_util_test.py
index c0fb6798..e8658772 100644
--- a/tensorboard/plugin_util_test.py
+++ b/tensorboard/plugin_util_test.py
@@ -157,6 +157,13 @@ class MarkdownsToSafeHTMLTest(tb_test.TestCase):
         expected = "&lt;script&gt;alert('unsafe!')&lt;/script&gt;<p>safe</p>"
         self.assertEqual(actual, expected)

+    def test_sanitization_can_have_collateral_damage(self):
+        inputs = ["<table title=\"chuckles", "I'm in danger", "<table>\">"]
+        combine = lambda xs: "".join(xs)
+        actual = plugin_util.markdowns_to_safe_html(inputs, combine)
+        expected = "<table></table>"
+        self.assertEqual(actual, expected)
+

 class ExperimentIdTest(tb_test.TestCase):
     """Tests for `plugin_util.experiment_id`."""
--

.. but yes, I agree this isn't a concern in practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, perfect. Thank you! Added test.

"""
unsafe_htmls = []
total_null_bytes = 0
import contextlib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a fever and the only cure is moaaar contextlib

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, thanks. (There was a nullcontext in here for testing…)

Comment on lines +98 to +100
also be less precise: if one of the input documents has unsafe HTML
that is sanitized away, that sanitization might affect other
documents, even if those documents are safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to oblige.

---
 tensorboard/plugin_util_test.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tensorboard/plugin_util_test.py b/tensorboard/plugin_util_test.py
index c0fb6798..e8658772 100644
--- a/tensorboard/plugin_util_test.py
+++ b/tensorboard/plugin_util_test.py
@@ -157,6 +157,13 @@ class MarkdownsToSafeHTMLTest(tb_test.TestCase):
         expected = "&lt;script&gt;alert('unsafe!')&lt;/script&gt;<p>safe</p>"
         self.assertEqual(actual, expected)

+    def test_sanitization_can_have_collateral_damage(self):
+        inputs = ["<table title=\"chuckles", "I'm in danger", "<table>\">"]
+        combine = lambda xs: "".join(xs)
+        actual = plugin_util.markdowns_to_safe_html(inputs, combine)
+        expected = "<table></table>"
+        self.assertEqual(actual, expected)
+

 class ExperimentIdTest(tb_test.TestCase):
     """Tests for `plugin_util.experiment_id`."""
--

.. but yes, I agree this isn't a concern in practice.

wchargin-branch: text-batch-bleach
wchargin-source: 98c30df804d74250f6507991ba93a83bbb699ce6
@wchargin wchargin merged commit 0388130 into master Apr 18, 2020
@wchargin wchargin deleted the wchargin-text-batch-bleach branch April 18, 2020 02:33
caisq pushed a commit to caisq/tensorboard that referenced this pull request May 19, 2020
Summary:
By design, we only expose an API for converting Markdown to *safe* HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.

Test Plan:
The `/data/plugin/text/text` route sees 40–60% speedup: on my machine,

  - from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
    order tensors” text demo downsampled to 10 steps; and
  - from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
    dataset with 32 Markdown cells per step downsampled to 100 steps.

wchargin-branch: text-batch-bleach
caisq pushed a commit that referenced this pull request May 27, 2020
Summary:
By design, we only expose an API for converting Markdown to *safe* HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.

Test Plan:
The `/data/plugin/text/text` route sees 40–60% speedup: on my machine,

  - from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
    order tensors” text demo downsampled to 10 steps; and
  - from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
    dataset with 32 Markdown cells per step downsampled to 100 steps.

wchargin-branch: text-batch-bleach
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes plugin:text theme:performance Performance, scalability, large data sizes, slowness, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants