KEMBAR78
Several improvements / simplifications in Regex by stephentoub · Pull Request #100315 · dotnet/runtime · GitHub
Skip to content

Conversation

@stephentoub
Copy link
Member

This started out as a small improvement for one thing and grew to be something else.

Initially, my intent was just to improve how SearchValues<char> applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits [\u0000-\u007F-[0-9]]. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set.

However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.

This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, so we can just do the range check first and skip the more expensive set check if it's not needed.

That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate..

This started out as a small improvement for one thing and grew to be something else.

Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down.  We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**.

However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII).  **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.**

This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.**

That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**.
Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion

The probabilistic map currently only has a vectorized path for IndexOfAny.
We could add LastIndexOfAny pretty easily (the interesting code would be shared with IndexOfAny). I take it this change might make that more important?

We don't, however, have a vectorized fallback for AnyExcept after falling off the Ascii fast path (no "inverse bloom filter").
Currently, we'll fall back to an O(i * m) loop for that. We could make it O(i), but it'd still be char-by-char.
Is that a concern here?

…r.Emitter.cs

Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
@stephentoub
Copy link
Member Author

We could add LastIndexOfAny pretty easily (the interesting code would be shared with IndexOfAny). I take it this change might make that more important?

Yeah, that'd be useful to add in, in particular if we can do it without adding in much code.

Is that a concern here?

Why would it be O(i * m)? Don't we have some O(1) implementation for determining whether a character is in the set represented by the SearchValues, which would make it O(i)? The alternative to not using SearchValues here is falling back to a linear walk, so I'm not too concerned about it as long as the overheads compared to that aren't meaningful.

@MihaZupan
Copy link
Member

Don't we have some O(1) implementation for determining whether a character is in the set represented by the SearchValues, which would make it O(i)

The single-character check when using the probabilistic map is currently O(needle) when the character is in the set (which is the common case for -Except operations). We could of course switch to a HashSet/lookup table instead.

@stephentoub
Copy link
Member Author

Don't we have some O(1) implementation for determining whether a character is in the set represented by the SearchValues, which would make it O(i)

The single-character check when using the probabilistic map is currently O(needle) when the character is in the set (which is the common case for -Except operations). We could of course switch to a HashSet/lookup table instead.

Ok, I hadn't realized that (or maybe I knew it at one point and forgot). It'd be good to switch it to using a set lookup.

@stephentoub stephentoub merged commit b040ed6 into dotnet:main Apr 3, 2024
@stephentoub stephentoub deleted the regextweaks branch April 3, 2024 16:05
@TonyValenti
Copy link

I just wanted y'all to know I love reading commit notes like this!

@matouskozak
Copy link
Member

Possible related Mono regressions:

@stephentoub
Copy link
Member Author

stephentoub commented Apr 10, 2024

The regressions mostly look to be places we're now using SearchValues<char> more. For example, the mono regression is because code previously generated to be:

while ((uint)iteration < (uint)slice.Length && ((ch = slice[iteration]) < 128 ? ("\0\0怀Ͽ\ufffe\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\f\0-/0:A[_`a{KÅ")))
{
    iteration++;
}

is now generated to be:

int iteration = slice.IndexOfAnyExcept(Utilities.s_nonAscii_2D303132333435363738394142434445464748494A4B4C4D4E4F505152535455565758595A6162636465666768696A6B6C6D6E6F707172737475767778797AE284AA);
if (iteration < 0)
{
    iteration = slice.Length;
}
...
internal static readonly SearchValues<char> s_nonAscii_2D303132333435363738394142434445464748494A4B4C4D4E4F505152535455565758595A6162636465666768696A6B6C6D6E6F707172737475767778797AE284AA = SearchValues.Create("-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");

@MihaZupan, will this hit problematic paths? (Note the non-ASCII Kelvin sign at the end there.)

@MihaZupan
Copy link
Member

MihaZupan commented Apr 10, 2024

Yes, IndexOfAnyExcept here could hit the slow O(i * m) fallback.

If have have explicit support for ISAs (IndexOfAnyAsciiSearcher.IsVectorizationSupported) we'll start with a vectorized ASCII scan, but as soon as we hit any non-ascii char, we'll go back to that same slow fallback.

I've been playing around with a few options here (e.g. perfect hash / lookup table ...), I'll still have to throw a few more hours at it though.

@MihaZupan
Copy link
Member

MihaZupan commented Apr 11, 2024

E.g. for "-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK" if you hit a non-ASCII char:
(current vs. O(1) checks)

Method Toolchain Length TextHasNonAscii Mean Ratio
IndexOfAnyExcept main 1000 True 4,020.8 ns 1.00
IndexOfAnyExcept pr 1000 True 661.3 ns 0.16

@stephentoub
Copy link
Member Author

E.g. for "-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK" if you hit a non-ASCII char: (current vs. O(1) checks)

Method Toolchain Length TextHasNonAscii Mean Ratio
IndexOfAnyExcept main 1000 True 4,020.8 ns 1.00
IndexOfAnyExcept pr 1000 True 661.3 ns 0.16

"pr" here is with your upcoming fixes?

@MihaZupan
Copy link
Member

Yeah, should be on that order of magnitude

matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
* Several improvements / simplifications in Regex

This started out as a small improvement for one thing and grew to be something else.

Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down.  We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**.

However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII).  **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.**

This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.**

That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**.

* Update src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>

---------

Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
@github-actions github-actions bot locked and limited conversation to collaborators May 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants