KEMBAR78
Improve choice of IndexOfXx routine for some TryFindNextStartingPosition implementations by stephentoub · Pull Request #89099 · dotnet/runtime · GitHub
Skip to content

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented Jul 18, 2023

Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set [^aA] that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a'). This fixes the compiler / source generator to prefer such helpers instead when available.

For example, previously [GeneratedRegex(@"[^Aa]")] would result in this being emitted:

        /// <summary>Finds the next index of any character that matches a character in the set [^Aa].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_ascii_200000002000000);
            if ((uint)i < (uint)span.Length)
            {
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                do
                {
                    if (((span[i] | 0x20) != 'a'))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "Aa".</summary>
        internal static readonly SearchValues<char> s_ascii_200000002000000 = SearchValues.Create("Aa");

which is then used like:

int i = inputSpan.Slice(pos).IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F();

Now, that method isn't emitted, and the usage ends up just being:

int i = inputSpan.Slice(pos).IndexOfAnyExcept('A', 'a');

Fixes #84150

…ion implementations

Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones.  When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan.  This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set `[^aA]` that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a').  This fixes the compiler / source generator to prefer such helpers instead when available.
@stephentoub stephentoub added this to the 8.0.0 milestone Jul 18, 2023
@ghost ghost assigned stephentoub Jul 18, 2023
@ghost
Copy link

ghost commented Jul 18, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Earlier in .NET 8, we updated the Regex compiler and source generator to be able to vectorize a search for any set, not just simple ones. When one of the main routines couldn't be used, we emit a specialized IndexOfAny helper that uses SearchValues to search for any matching ASCII character or a Unicode character, and if it encounters a Unicode character, it falls back to a linear scan. This meant that a bunch of sets that wouldn't previously have taken these paths now do, but some of those sets have more efficient means of searching; for example, for the set [^aA] that searches case-insensitive for anything other than an 'A', with these scheme we'll emit a whole routine that uses SearchValues with a fallback, but we could just use IndexOfAnyExcept('A', 'a'). This fixes the compiler / source generator to prefer such helpers instead when available.

For example, previously [GeneratedRegex(@"[^Aa]")] would result in this being emitted:

        /// <summary>Finds the next index of any character that matches a character in the set [^Aa].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_ascii_200000002000000);
            if ((uint)i < (uint)span.Length)
            {
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                do
                {
                    if (((span[i] | 0x20) != 'a'))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "Aa".</summary>
        internal static readonly SearchValues<char> s_ascii_200000002000000 = SearchValues.Create("Aa");

which is then used like:

int i = inputSpan.Slice(pos).IndexOfNonAsciiOrAny_43339B0AA38B69F44701E535D8179738784663016A5CB17A6B1AEB2FB5F9D08F();

Now, that method isn't emitted, and the usage ends up just being:

int i = inputSpan.Slice(pos).IndexOfAnyExcept('A', 'a');
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

Copy link
Member

@MihaZupan MihaZupan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I appreciate the extra comments :)

@stephentoub stephentoub merged commit 74d69fd into dotnet:main Jul 18, 2023
@stephentoub stephentoub deleted the regexsearch branch July 18, 2023 21:38
@ghost ghost locked as resolved and limited conversation to collaborators Aug 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regex should explicitly target IndexOfAnyExcept(char {, char, char}) when handling non-ASCII vectorized sets

2 participants