KEMBAR78
Augment Regex extensibility point for better perf and span-based matching · Issue #59629 · dotnet/runtime · GitHub
Skip to content

Augment Regex extensibility point for better perf and span-based matching #59629

@stephentoub

Description

@stephentoub

Background and motivation

There are several related Regex efforts planned for .NET 7:

  1. Adding a regex source generator (merged into runtime in Add initial regex source generator #59186)
  2. Adding a DFA-based engine (merged into runtime in Add RegexOptions.NonBacktracking #60607)

On top of that, there are several long-standing issues to be addressed:

  1. Add support for matching with ReadOnlySpan<char> inputs in addition to string inputs.
  2. Allow regex matching to be zero-alloc. Today, finding the locations of matches requires allocating Match objects (and supporting data structures) for every match (on my 64-bit machine, a successful Match(string) with no subcaptures allocates three objects for a total of ~170b).
  3. Continual push to improve performance of matching. Regex optimization opportunities #1349)

Each of these brings with it some difficulties that, altogether, suggest we augment how Regex supports extensibility. Regex enables compilation of regular expressions via an extensibility point (such compilation includes in-memory with RegexOptions.Compiled, Regex.CompileToAssembly on .NET Framework, and now the RegexGenerator in .NET 7). Regex exposes several protected fields (!), one of which is of type RegexRunnerFactory. When the public APIs on Regex need to perform a match, they do so via a RegexRunner, and if no RegexRunner is available (e.g. we’ve not performed a match yet), that RegexRunnerFactory is used to create one. RegexRunner then has several abstract methods, the most important ones being bool FindFirstChar() and void Go(). Thus, when we compile a regex, we need to produce a RegexRunner-derived type that overrides FindFirstChar/Go, a RegexRunnerFactory-derived type that overrides CreateInstance(), and a Regex-derived type that sets the protected factory field to an instance of this RegexRunnerFactory. This is exactly what the source generator spits out for a given use of [RegexGenerator(…)].

Now, there are a few relevant issues here:

  1. We can’t store a span onto a RegexRunner, and as FindFirstChar/Go are parameterless, we don’t have any way to pass a span into them. Thus, we currently lack a means for having the code generated in overrides of these from operating on ReadOnlySpan<char>. To match an input ReadOnlySpan<char>, we could convert the input ReadOnlySpan<char> into a string, but obviously that defeats the point.
  2. The original split between FindFirstChar and Go was likely predicated on the notion that there was no benefit to combining them because the employed matching algorithm logically splits them in this way: find a place that could match, try to match, find the next place, try to match, etc. But that is not how a DFA-based engine operates, where every character moves to the next state, whether that character is ultimately part of the match or not. This split is thus very artificial for a DFA-based matcher, which essentially has all of the logic in Go, with FindFirstChar effectively being a nop.
  3. With our current engine, it’s effectively zero cost to determine the bounds of a match once we know that there is a match. With the DFA-based engine, we don’t know the starting position of a match when we determine there is a match, only the ending position, and thus we need to do more work to determine the bounds of the match. These costs could be avoided for IsMatch operations, where the starting position isn’t relevant, but whereas the internal loop today that invokes FindFirstChar/Go knows whether this is IsMatch or not, that information isn’t available to overrides of FindFirstChar/Go, which means today it’s unavailable to source generated code.
  4. Even for the existing implementation, the split between FindFirstChar and Go leads to various inefficiencies. For example, every time we find a possible match location, we need to make two virtual calls, one to FindFirstChar and one to Go. Each of those calls needs to load the relevant state from fields. We end up duplicating some matching logic because FindFirstChar wants to validate some things about the match before handing it off to Go in order to avoid those exact costs through the outer bump loop. Etc.

These issues could all be addressed internally to the library, except that doing so can’t support the source generator, as doing so would require internal APIs that the code it generates can’t see. We thus need just enough new surface area to support this. We also want to address this in the same release as shipping the source generator initially; otherwise, any implementations generated based on .NET 7 would lack the support necessary to enable these things, and all such libraries would need to be recompiled to benefit, e.g. we could make an IsMatch(ReadOnlySpan<char>) work without recompilation, but it would involving converting the span to a string.

This issue mostly replaces #23602. However, that issue proposes additional span-based Split, Replace, and {Un}Escape methods, all of which could be built once this functionality is in place.

API Proposal

We should:

  1. Expose a new virtual RegexRunner.Scan method:
namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
+        protected virtual void Scan(ReadOnlySpan<char> text);
    }
}

The base implementation would effectively contain the existing bumpalong loop that invokes both FindFirstChar and Go. It would compare the provided span against runtext and set runtext to span.ToString() only if they differ (as noted below, this base implementation will rarely be used and will be there purely for compat).

  1. Override this Scan method on all our implementations and in the source generator-generated code: the only time you’d get the base implementation would be if you were using an old .NET Framework-generated CompileToAssembly assembly, and the only time it would lead to inefficiencies is if you then used those in combination with the new span-based public APIs. The implementations of FindFirstChar/Go for the interpreter, RegexOptions.Compiled, and source generator would all be combined into this single method. NonBacktracking effectively already has its combined.

  2. Make the existing FindFirstChar/Go/InitTrackCount methods virtual instead of abstract. The first two would throw NotImplementedExceptions; the third would return 0. None of our implementations would need to override these methods anymore, just Scan.

namespace System.Text.RegularExpressions
{
  public class RegexRunner
  {
-    protected abstract bool FindFirstChar();
-    protected abstract void Go();
-    protected abstract void InitTrackCount();
+    protected virtual bool FindFirstChar() => throw new NotImplementedException(); //default implementation
+    protected virtual void Go() => throw new NotImplementedException(); //default implementation
+    protected virtual void InitTrackCount() => return 0; //default implementation
  }
}
  1. Add new public IsMatch methods to Regex (these are the same set of overloads as for string, but with a ReadOnlySpan<char> input):
namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public bool IsMatch(ReadOnlySpan<char> input);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}

API Usage

bool matched = Regex.IsMatch(spanInput, "pattern");

Risks

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions