text: batch HTML sanitization #3529

wchargin · 2020-04-17T22:43:12Z

Summary:
By design, we only expose an API for converting Markdown to safe HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.

Test Plan:
The /data/plugin/text/text route sees 40–60% speedup: on my machine,

from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
order tensors” text demo downsampled to 10 steps; and
from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
dataset with 32 Markdown cells per step downsampled to 100 steps.

wchargin-branch: text-batch-bleach

Summary: By design, we only expose an API for converting Markdown to *safe* HTML. But clients who call this method many times end up performing HTML sanitization many times, which is expensive: about an order of magnitude more expensive than Markdown conversion itself. This patch introduces a new API that still only emits safe HTML but enables clients to combine multiple input documents with only one round of sanitization. Test Plan: The `/data/plugin/text/text` route sees 40–60% speedup: on my machine, - from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher order tensors” text demo downsampled to 10 steps; and - from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal dataset with 32 Markdown cells per step downsampled to 100 steps. wchargin-branch: text-batch-bleach

wchargin · 2020-04-17T22:49:07Z

tensorboard/plugin_util.py

+    also be less precise: if one of the input documents has unsafe HTML
+    that is sanitized away, that sanitization might affect other
+    documents, even if those documents are safe.


I can’t actually find a way to trigger this. What I mean is that if you
could find input strings x, y, and z such that

markdown.markdown(x) == '<a target="*chuckles* '

markdown.markdown(y) == "I'm in danger"

markdown.markdown(z) == '"></a>'

then markdowns_to_safe_html([x, y, z], "".join) would lose the content
of y even though it was itself safe and didn’t “deserve” to get
bleached. But I can’t find any strings x that would create open tags
like that (though admittedly I didn’t try very hard).

If x == "<div>hmm" then markdown.markdown(x) == x, but the same is
not true if x == "<a>hmm", and <div> is not on our Bleach whitelist.

Given that the output is always safe and is unreasonably behaved (but
still safe) only when the input is unsafe, this seems reasonable to me.

Happy to oblige.

--- tensorboard/plugin_util_test.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tensorboard/plugin_util_test.py b/tensorboard/plugin_util_test.py index c0fb6798..e8658772 100644 --- a/tensorboard/plugin_util_test.py +++ b/tensorboard/plugin_util_test.py @@ -157,6 +157,13 @@ class MarkdownsToSafeHTMLTest(tb_test.TestCase): expected = "<script>alert('unsafe!')</script><p>safe</p>" self.assertEqual(actual, expected) + def test_sanitization_can_have_collateral_damage(self): + inputs = ["<table title=\"chuckles", "I'm in danger", "<table>\">"] + combine = lambda xs: "".join(xs) + actual = plugin_util.markdowns_to_safe_html(inputs, combine) + expected = "<table></table>" + self.assertEqual(actual, expected) + class ExperimentIdTest(tb_test.TestCase): """Tests for `plugin_util.experiment_id`.""" --

.. but yes, I agree this isn't a concern in practice.

Wow, perfect. Thank you! Added test.

nfelt · 2020-04-17T22:50:27Z

tensorboard/plugin_util.py

+    """
+    unsafe_htmls = []
+    total_null_bytes = 0
+    import contextlib


I've got a fever and the only cure is moaaar contextlib

Oops, yes, thanks. (There was a nullcontext in here for testing…)

nfelt · 2020-04-17T23:20:15Z

tensorboard/plugin_util.py

+    also be less precise: if one of the input documents has unsafe HTML
+    that is sanitized away, that sanitization might affect other
+    documents, even if those documents are safe.


Happy to oblige.

--- tensorboard/plugin_util_test.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tensorboard/plugin_util_test.py b/tensorboard/plugin_util_test.py index c0fb6798..e8658772 100644 --- a/tensorboard/plugin_util_test.py +++ b/tensorboard/plugin_util_test.py @@ -157,6 +157,13 @@ class MarkdownsToSafeHTMLTest(tb_test.TestCase): expected = "<script>alert('unsafe!')</script><p>safe</p>" self.assertEqual(actual, expected) + def test_sanitization_can_have_collateral_damage(self): + inputs = ["<table title=\"chuckles", "I'm in danger", "<table>\">"] + combine = lambda xs: "".join(xs) + actual = plugin_util.markdowns_to_safe_html(inputs, combine) + expected = "<table></table>" + self.assertEqual(actual, expected) + class ExperimentIdTest(tb_test.TestCase): """Tests for `plugin_util.experiment_id`.""" --

.. but yes, I agree this isn't a concern in practice.

wchargin-branch: text-batch-bleach wchargin-source: 98c30df804d74250f6507991ba93a83bbb699ce6

Summary: By design, we only expose an API for converting Markdown to *safe* HTML. But clients who call this method many times end up performing HTML sanitization many times, which is expensive: about an order of magnitude more expensive than Markdown conversion itself. This patch introduces a new API that still only emits safe HTML but enables clients to combine multiple input documents with only one round of sanitization. Test Plan: The `/data/plugin/text/text` route sees 40–60% speedup: on my machine, - from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher order tensors” text demo downsampled to 10 steps; and - from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal dataset with 32 Markdown cells per step downsampled to 100 steps. wchargin-branch: text-batch-bleach

wchargin added plugin:text theme:performance Performance, scalability, large data sizes, slowness, etc. labels Apr 17, 2020

wchargin requested a review from nfelt April 17, 2020 22:43

googlebot added the cla: yes label Apr 17, 2020

wchargin commented Apr 17, 2020

View reviewed changes

nfelt approved these changes Apr 17, 2020

View reviewed changes

[update patch]

94f75c2

wchargin-branch: text-batch-bleach wchargin-source: 98c30df804d74250f6507991ba93a83bbb699ce6

wchargin merged commit 0388130 into master Apr 18, 2020

wchargin deleted the wchargin-text-batch-bleach branch April 18, 2020 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

text: batch HTML sanitization #3529

text: batch HTML sanitization #3529

Uh oh!

wchargin commented Apr 17, 2020

Uh oh!

wchargin Apr 17, 2020 •

edited

Loading

Uh oh!

nfelt Apr 17, 2020

Uh oh!

wchargin Apr 17, 2020

Uh oh!

nfelt Apr 17, 2020

Uh oh!

wchargin Apr 17, 2020

Uh oh!

nfelt Apr 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

text: batch HTML sanitization #3529

text: batch HTML sanitization #3529

Uh oh!

Conversation

wchargin commented Apr 17, 2020

Uh oh!

wchargin Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nfelt Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

wchargin Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

nfelt Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

wchargin Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

nfelt Apr 17, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wchargin Apr 17, 2020 •

edited

Loading