-
Notifications
You must be signed in to change notification settings - Fork 1.7k
text: batch HTML sanitization #3529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary:
By design, we only expose an API for converting Markdown to *safe* HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.
Test Plan:
The `/data/plugin/text/text` route sees 40–60% speedup: on my machine,
- from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
order tensors” text demo downsampled to 10 steps; and
- from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
dataset with 32 Markdown cells per step downsampled to 100 steps.
wchargin-branch: text-batch-bleach
| also be less precise: if one of the input documents has unsafe HTML | ||
| that is sanitized away, that sanitization might affect other | ||
| documents, even if those documents are safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can’t actually find a way to trigger this. What I mean is that if you
could find input strings x, y, and z such that
markdown.markdown(x) == '<a target="*chuckles* 'markdown.markdown(y) == "I'm in danger"markdown.markdown(z) == '"></a>'
then markdowns_to_safe_html([x, y, z], "".join) would lose the content
of y even though it was itself safe and didn’t “deserve” to get
bleached. But I can’t find any strings x that would create open tags
like that (though admittedly I didn’t try very hard).
If x == "<div>hmm" then markdown.markdown(x) == x, but the same is
not true if x == "<a>hmm", and <div> is not on our Bleach whitelist.
Given that the output is always safe and is unreasonably behaved (but
still safe) only when the input is unsafe, this seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to oblige.
---
tensorboard/plugin_util_test.py | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/tensorboard/plugin_util_test.py b/tensorboard/plugin_util_test.py
index c0fb6798..e8658772 100644
--- a/tensorboard/plugin_util_test.py
+++ b/tensorboard/plugin_util_test.py
@@ -157,6 +157,13 @@ class MarkdownsToSafeHTMLTest(tb_test.TestCase):
expected = "<script>alert('unsafe!')</script><p>safe</p>"
self.assertEqual(actual, expected)
+ def test_sanitization_can_have_collateral_damage(self):
+ inputs = ["<table title=\"chuckles", "I'm in danger", "<table>\">"]
+ combine = lambda xs: "".join(xs)
+ actual = plugin_util.markdowns_to_safe_html(inputs, combine)
+ expected = "<table></table>"
+ self.assertEqual(actual, expected)
+
class ExperimentIdTest(tb_test.TestCase):
"""Tests for `plugin_util.experiment_id`."""
--
.. but yes, I agree this isn't a concern in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, perfect. Thank you! Added test.
tensorboard/plugin_util.py
Outdated
| """ | ||
| unsafe_htmls = [] | ||
| total_null_bytes = 0 | ||
| import contextlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a fever and the only cure is moaaar contextlib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes, thanks. (There was a nullcontext in here for testing…)
| also be less precise: if one of the input documents has unsafe HTML | ||
| that is sanitized away, that sanitization might affect other | ||
| documents, even if those documents are safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to oblige.
---
tensorboard/plugin_util_test.py | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/tensorboard/plugin_util_test.py b/tensorboard/plugin_util_test.py
index c0fb6798..e8658772 100644
--- a/tensorboard/plugin_util_test.py
+++ b/tensorboard/plugin_util_test.py
@@ -157,6 +157,13 @@ class MarkdownsToSafeHTMLTest(tb_test.TestCase):
expected = "<script>alert('unsafe!')</script><p>safe</p>"
self.assertEqual(actual, expected)
+ def test_sanitization_can_have_collateral_damage(self):
+ inputs = ["<table title=\"chuckles", "I'm in danger", "<table>\">"]
+ combine = lambda xs: "".join(xs)
+ actual = plugin_util.markdowns_to_safe_html(inputs, combine)
+ expected = "<table></table>"
+ self.assertEqual(actual, expected)
+
class ExperimentIdTest(tb_test.TestCase):
"""Tests for `plugin_util.experiment_id`."""
--
.. but yes, I agree this isn't a concern in practice.
wchargin-branch: text-batch-bleach wchargin-source: 98c30df804d74250f6507991ba93a83bbb699ce6
Summary:
By design, we only expose an API for converting Markdown to *safe* HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.
Test Plan:
The `/data/plugin/text/text` route sees 40–60% speedup: on my machine,
- from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
order tensors” text demo downsampled to 10 steps; and
- from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
dataset with 32 Markdown cells per step downsampled to 100 steps.
wchargin-branch: text-batch-bleach
Summary:
By design, we only expose an API for converting Markdown to *safe* HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.
Test Plan:
The `/data/plugin/text/text` route sees 40–60% speedup: on my machine,
- from 0.38 ± 0.04 seconds to 0.211 ± 0.005 seconds on the “higher
order tensors” text demo downsampled to 10 steps; and
- from 5.3 ± 0.9 seconds to 2.1 ± 0.2 seconds on a Google-internal
dataset with 32 Markdown cells per step downsampled to 100 steps.
wchargin-branch: text-batch-bleach
Summary:
By design, we only expose an API for converting Markdown to safe HTML.
But clients who call this method many times end up performing HTML
sanitization many times, which is expensive: about an order of magnitude
more expensive than Markdown conversion itself. This patch introduces a
new API that still only emits safe HTML but enables clients to combine
multiple input documents with only one round of sanitization.
Test Plan:
The
/data/plugin/text/textroute sees 40–60% speedup: on my machine,order tensors” text demo downsampled to 10 steps; and
dataset with 32 Markdown cells per step downsampled to 100 steps.
wchargin-branch: text-batch-bleach