KEMBAR78
Use SearchValues for non-ASCII Regex fallback by stephentoub · Pull Request #89140 · dotnet/runtime · GitHub
Skip to content

Conversation

@stephentoub
Copy link
Member

When we encounter a set for which we can't use one of the IndexOfXx variants, we use IndexOfAny with SearchValues. If the whole set is ASCII, this is easy. But if there are any non-ASCII characters in the set, we currently emit a helper method that first does a search for all of the ASCII values or anything that's non-ASCII, and then if something non-ASCII is found, it proceeds to do a scalar scan matching every character.

Now that SearchValues has a vectorized implementation of a probabilistic map, we can instead fall back to that when the set isn't too large and thus can be fully enumerated into a SearchValues instance. We still do this as a two-step process, as searching for the ASCII subset is measurably faster than the probabilistic map search. However, if we see that there's no ASCII at all in the target set, we can then skip the helper entirely and just do the IndexOfAny with the SearchValues for the non-ASCII targets.

For example, previously:

[GeneratedRegex(@"[a-z]", RegexOptions.IgnoreCase)]

would result in emitting this helper (since it's not only inclusive of A-Z and a-z, but also the Kelvin sign)

        /// <summary>Finds the next index of any character that matches a character in the set [A-Za-z\u212A].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_0E8811570F3718B3BBF9E5001539AEB605A6A55AB8C5DCF3ECD130597FA78CEC(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptLetters);
            if ((uint)i < (uint)span.Length)
            {
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                char ch;
                do
                {
                    if (((ch = span[i]) < 128 ? char.IsAsciiLetter(ch) : RegexRunner.CharInClass((char)ch, "\0\u0006\0A[a{KÅ")))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@[\\]^_`{|}~\u007f".</summary>
        internal static readonly SearchValues<char> s_asciiExceptLetters = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f");

That helper is now emitted as:

        /// <summary>Finds the next index of any character that matches a character in the set [A-Za-z\u212A].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_0E8811570F3718B3BBF9E5001539AEB605A6A55AB8C5DCF3ECD130597FA78CEC(this ReadOnlySpan<char> span)
        {
            // Search for the first character that's either ASCII and in the target set or non-ASCII (whether or not it's in the target set.
            int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptLetters);
            if ((uint)i < (uint)span.Length)
            {
                // If the character at the found position is ASCII, it's in the target set.
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                // Search for the first character that's in the target set.
                int j = span.Slice(i).IndexOfAny(Utilities.s_nonAscii_326E1FD0AD567A84CAD13F2BE521A57789829F59D59ABE37F9E111D0182B6601);
                if (j >= 0)
                {
                    return i + j;
                }
            }
        
            // No match found.
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@[\\]^_`{|}~\u007f".</summary>
        internal static readonly SearchValues<char> s_asciiExceptLetters = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f");
        
        /// <summary>Supports searching for characters in or not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK".</summary>
        internal static readonly SearchValues<char> s_nonAscii_326E1FD0AD567A84CAD13F2BE521A57789829F59D59ABE37F9E111D0182B6601 = SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");

@stephentoub stephentoub added this to the 8.0.0 milestone Jul 18, 2023
@ghost ghost assigned stephentoub Jul 18, 2023
@ghost
Copy link

ghost commented Jul 18, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

When we encounter a set for which we can't use one of the IndexOfXx variants, we use IndexOfAny with SearchValues. If the whole set is ASCII, this is easy. But if there are any non-ASCII characters in the set, we currently emit a helper method that first does a search for all of the ASCII values or anything that's non-ASCII, and then if something non-ASCII is found, it proceeds to do a scalar scan matching every character.

Now that SearchValues has a vectorized implementation of a probabilistic map, we can instead fall back to that when the set isn't too large and thus can be fully enumerated into a SearchValues instance. We still do this as a two-step process, as searching for the ASCII subset is measurably faster than the probabilistic map search. However, if we see that there's no ASCII at all in the target set, we can then skip the helper entirely and just do the IndexOfAny with the SearchValues for the non-ASCII targets.

For example, previously:

[GeneratedRegex(@"[a-z]", RegexOptions.IgnoreCase)]

would result in emitting this helper (since it's not only inclusive of A-Z and a-z, but also the Kelvin sign)

        /// <summary>Finds the next index of any character that matches a character in the set [A-Za-z\u212A].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_0E8811570F3718B3BBF9E5001539AEB605A6A55AB8C5DCF3ECD130597FA78CEC(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptLetters);
            if ((uint)i < (uint)span.Length)
            {
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                char ch;
                do
                {
                    if (((ch = span[i]) < 128 ? char.IsAsciiLetter(ch) : RegexRunner.CharInClass((char)ch, "\0\u0006\0A[a{KÅ")))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@[\\]^_`{|}~\u007f".</summary>
        internal static readonly SearchValues<char> s_asciiExceptLetters = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f");

That helper is now emitted as:

        /// <summary>Finds the next index of any character that matches a character in the set [A-Za-z\u212A].</summary>
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        internal static int IndexOfNonAsciiOrAny_0E8811570F3718B3BBF9E5001539AEB605A6A55AB8C5DCF3ECD130597FA78CEC(this ReadOnlySpan<char> span)
        {
            // Search for the first character that's either ASCII and in the target set or non-ASCII (whether or not it's in the target set.
            int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptLetters);
            if ((uint)i < (uint)span.Length)
            {
                // If the character at the found position is ASCII, it's in the target set.
                if (char.IsAscii(span[i]))
                {
                    return i;
                }
        
                // Search for the first character that's in the target set.
                int j = span.Slice(i).IndexOfAny(Utilities.s_nonAscii_326E1FD0AD567A84CAD13F2BE521A57789829F59D59ABE37F9E111D0182B6601);
                if (j >= 0)
                {
                    return i + j;
                }
            }
        
            // No match found.
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@[\\]^_`{|}~\u007f".</summary>
        internal static readonly SearchValues<char> s_asciiExceptLetters = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f");
        
        /// <summary>Supports searching for characters in or not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK".</summary>
        internal static readonly SearchValues<char> s_nonAscii_326E1FD0AD567A84CAD13F2BE521A57789829F59D59ABE37F9E111D0182B6601 = SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 8.0.0

When we encounter a set for which we can't use one of the IndexOfXx variants, we use IndexOfAny with SearchValues.  If the whole set is ASCII, this is easy.  But if there are any non-ASCII characters in the set, we currently emit a helper method that first does a search for all of the ASCII values or anything that's non-ASCII, and then if something non-ASCII is found, it proceeds to do a scalar scan matching every character.

Now that SearchValues has a vectorized implementation of a probabilistic map, we can instead fall back to that when the set isn't too large and thus can be fully enumerated into a SearchValues instance.  We still do this as a two-step process, as searching for the ASCII subset is measurably faster than the probabilistic map search.  However, if we see that there's no ASCII at all in the target set, we can then skip the helper entirely and just do the IndexOfAny with the SearchValues for the non-ASCII targets.
@stephentoub
Copy link
Member Author

Closing in favor of #89205 now that #89155 has been merged.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 19, 2023
@stephentoub stephentoub deleted the searchvaluesfallback branch January 8, 2024 14:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant