Augment Regex extensibility point for better perf and span-based matching

### Background and motivation

There are several related `Regex` efforts planned for .NET 7:
1. Adding a regex source generator (merged into runtime in #59186)
2. Adding a DFA-based engine (merged into runtime in https://github.com/dotnet/runtime/pull/60607)
 
On top of that, there are several long-standing issues to be addressed:

3. Add support for matching with `ReadOnlySpan<char>` inputs in addition to string inputs.
4. Allow regex matching to be zero-alloc.  Today, finding the locations of matches requires allocating `Match` objects (and supporting data structures) for every match (on my 64-bit machine, a successful Match(string) with no subcaptures allocates three objects for a total of ~170b).
5. Continual push to improve performance of matching. #1349)
 
Each of these brings with it some difficulties that, altogether, suggest we augment how `Regex` supports extensibility. `Regex` enables compilation of regular expressions via an extensibility point (such compilation includes in-memory with `RegexOptions.Compiled`, `Regex.CompileToAssembly` on .NET Framework, and now the `RegexGenerator` in .NET 7).  Regex exposes several protected fields (!), one of which is of type `RegexRunnerFactory`.  When the public APIs on `Regex` need to perform a match, they do so via a `RegexRunner`, and if no `RegexRunner` is available (e.g. we’ve not performed a match yet), that `RegexRunnerFactory` is used to create one.  `RegexRunner` then has several abstract methods, the most important ones being `bool FindFirstChar()` and `void Go()`.  Thus, when we compile a regex, we need to produce a `RegexRunner`-derived type that overrides `FindFirstChar`/`Go`, a `RegexRunnerFactory`-derived type that overrides `CreateInstance()`, and a Regex-derived type that sets the protected factory field to an instance of this `RegexRunnerFactory`.  This is exactly what the source generator spits out for a given use of `[RegexGenerator(…)]`.
 
Now, there are a few relevant issues here:
1. We can’t store a span onto a `RegexRunner`, and as `FindFirstChar`/`Go` are parameterless, we don’t have any way to pass a span into them.  Thus, we currently lack a means for having the code generated in overrides of these from operating on `ReadOnlySpan<char>`.  To match an input `ReadOnlySpan<char>`, we could convert the input `ReadOnlySpan<char>` into a string, but obviously that defeats the point.
2. The original split between `FindFirstChar` and `Go` was likely predicated on the notion that there was no benefit to combining them because the employed matching algorithm logically splits them in this way: find a place that could match, try to match, find the next place, try to match, etc.  But that is not how a DFA-based engine operates, where every character moves to the next state, whether that character is ultimately part of the match or not.  This split is thus very artificial for a DFA-based matcher, which essentially has all of the logic in Go, with FindFirstChar effectively being a nop.
3. With our current engine, it’s effectively zero cost to determine the bounds of a match once we know that there is a match.  With the DFA-based engine, we don’t know the starting position of a match when we determine there is a match, only the ending position, and thus we need to do more work to determine the bounds of the match.  These costs could be avoided for IsMatch operations, where the starting position isn’t relevant, but whereas the internal loop today that invokes FindFirstChar/Go knows whether this is IsMatch or not, that information isn’t available to overrides of FindFirstChar/Go, which means today it’s unavailable to source generated code.
4. Even for the existing implementation, the split between FindFirstChar and Go leads to various inefficiencies.  For example, every time we find a possible match location, we need to make two virtual calls, one to FindFirstChar and one to Go.  Each of those calls needs to load the relevant state from fields.  We end up duplicating some matching logic because FindFirstChar wants to validate some things about the match before handing it off to Go in order to avoid those exact costs through the outer bump loop.  Etc.
 
These issues could all be addressed internally to the library, except that doing so can’t support the source generator, as doing so would require internal APIs that the code it generates can’t see.  We thus need just enough new surface area to support this.  We also want to address this in the same release as shipping the source generator initially; otherwise, any implementations generated based on .NET 7 would lack the support necessary to enable these things, and all such libraries would need to be recompiled to benefit, e.g. we could make an `IsMatch(ReadOnlySpan<char>)` work without recompilation, but it would involving converting the span to a string.

This issue mostly replaces https://github.com/dotnet/runtime/issues/23602.  However, that issue proposes additional span-based Split, Replace, and {Un}Escape methods, all of which could be built once this functionality is in place.

### API Proposal

We should:
 
1. Expose a new virtual RegexRunner.Scan method:
```diff
namespace System.Text.RegularExpressions
{
    public class RegexRunner
    {
+        protected virtual void Scan(ReadOnlySpan<char> text);
    }
}
```
The base implementation would effectively contain the existing bumpalong loop that invokes both `FindFirstChar` and `Go`.  It would compare the provided span against `runtext` and set `runtext` to `span.ToString()` only if they differ (as noted below, this base implementation will rarely be used and will be there purely for compat).
 
2. Override this `Scan` method on all our implementations and in the source generator-generated code: the only time you’d get the base implementation would be if you were using an old .NET Framework-generated `CompileToAssembly` assembly, and the only time it would lead to inefficiencies is if you then used those in combination with the new span-based public APIs.  The implementations of `FindFirstChar`/`Go` for the interpreter, `RegexOptions.Compiled`, and source generator would all be combined into this single method.  `NonBacktracking` effectively already has its combined.
 
3. Make the existing `FindFirstChar`/`Go`/`InitTrackCount` methods virtual instead of abstract. The first two would throw `NotImplementedExceptions`; the third would return `0`.  None of our implementations would need to override these methods anymore, just `Scan`.
```diff
namespace System.Text.RegularExpressions
{
  public class RegexRunner
  {
-    protected abstract bool FindFirstChar();
-    protected abstract void Go();
-    protected abstract void InitTrackCount();
+    protected virtual bool FindFirstChar() => throw new NotImplementedException(); //default implementation
+    protected virtual void Go() => throw new NotImplementedException(); //default implementation
+    protected virtual void InitTrackCount() => return 0; //default implementation
  }
}
```
 
4. Add new public `IsMatch` methods to `Regex` (these are the same set of overloads as for string, but with a `ReadOnlySpan<char>` input):
```diff
namespace System.Text.RegularExpressions
{
    public class Regex
    {
+        public bool IsMatch(ReadOnlySpan<char> input);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options);
+        public static bool IsMatch(ReadOnlySpan<char> input, string pattern, RegexOptions options, TimeSpan matchTimeout);
    }
}
```

### API Usage

```C#
bool matched = Regex.IsMatch(spanInput, "pattern");
```

### Risks

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Augment Regex extensibility point for better perf and span-based matching #59629

Background and motivation

API Proposal

API Usage

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Augment Regex extensibility point for better perf and span-based matching #59629

Description

Background and motivation

API Proposal

API Usage

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions