-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
This issue is breaking down one part of this original proposal: #59629
cc: @stephentoub
Background and motivation
We have an ongoing Regex investment for .NET 7 which is adding span-based matching APIs that would incur in allocation free operations. The ones being worked on in #59629 encompass only the IsMatch overloads, which are APIs where the caller only cares about finding if the input is a match or not, but don't actually want the Match object back. We can't really provide APIs that work over ReadOnlySpan<char>, are alloc-free and return a Match object since the Match object holds the string that matched with the captures, and because Match are Object types, they can't hold a span as a field.
Enumerate would not permit access to the full list of groups and captures, just the index/offset of the top-level capture, but in doing so, these Enumerate methods can become amortized zero-alloc: the enumerator is a ref struct, no objects are yielded, the input is a span, and the matching engine can reuse the internal Match object (and its supporting arrays) just as is done today with IsMatch to make it ammortized zero-alloc. If someone still needs the full details, they can fall back to using strings to begin with and the existing either Match or Matches, or (for some patterns, e.g. ones that don’t have anchors or lookaheads or lookbehinds that might go beyond the matching boundaries) re-run the engine with Match(matchString) just on the string representing the area of the input that matched. (The trouble with adding Regex.Match/Matches overloads for spans is the Match and MatchCollection types can’t store a Span; thus various surface area on these types couldn’t function with spans, like NextMatch… if we were to accept that, we could add span-based methods for those as well, but it would likely be confusing and inconsistent).
API Proposal
namespace System.Text.RegularExpressions
{
+ public readonly ref struct ValueMatch
+ {
+ public Range Range { get { throw null; } }
+ }
public class Regex : ISerializable
{
+ public ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input) { throw null; }
+ public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern) { throw null; }
+ public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options) { throw null; }
+ public static ValueMatchEnumerator EnumerateMatches(ReadOnlySpan<char> input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions options, TimeSpan matchTimeout) { throw null; }
+ public ref struct ValueMatchEnumerator
+ {
+ public readonly ValueMatchEnumerator GetEnumerator() { throw null; }
+ public bool MoveNext() { throw null; }
+ public readonly ValueMatch Current { get { throw null; } }
+ }
}
}API Usage
Regex regex = new Regex(@"\b\w+\b");
ReadOnlySpan<char> span = loremIpsum.AsSpan();
int lowerCaseWords = 0;
foreach (ValueMatch word in regex.EnumerateMatches(span))
{
if (span[word.Range][0] >= 'a' && span[word.Range][0] <= 'z')
{
lowerCaseWords++;
}
}Alternative Designs
One of the topics for discussion here is that the fact that the Enumerate method won't give access to capture data might be confusing for consumers, as it is not plain and obvious that the intention of this method is to be allocation free, so it is possible that consumers might expect to get a Match enumerable back. One of the ideas that have been suggested by @stephentoub is that we could also provide ref struct versions of Match, Group and Capture which basically would reference the ReadOnlySpan<char> and that would be what is returned by Enumerate instead, which would make the return value be more intuitive and more what consumers might expect.
Risks
We want to make sure that by introducing this new API, we don't introduce confusion on whether this should be used over some other existing API like Matches which already returns an enumerable of Match objects back. We have to make sure that we make it obvious when one should be used over the other.