KEMBAR78
Improve literal-after-loop regex optimization by stephentoub · Pull Request #93190 · dotnet/runtime · GitHub
Skip to content

Conversation

@stephentoub
Copy link
Member

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:

  • It currently gives up if the set loop is wrapped in an atomic and/or a capture.
  • It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
  • If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
  • If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.

Based on benchmark from https://github.com/BurntSushi/rebar#ruff-noqa:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[HideColumns("start", "end", "Error", "StdDev", "Job")]
public partial class Tests
{
    private static readonly Regex s_regex = new Regex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)", RegexOptions.Compiled);
    private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;

    [Benchmark]
    public int LineByLine()
    {
        int total = 0;
        foreach (ReadOnlySpan<char> line in s_haystack.AsSpan().EnumerateLines())
        {
            total += s_regex.Count(line);
        }
        return total;
    }

    [Benchmark]
    public int All() => s_regex.Count(s_haystack);
}
Method Toolchain Mean Ratio
LineByLine \main\corerun.exe 308.76 ms 1.00
LineByLine \pr\corerun.exe 66.82 ms 0.22
All \main\corerun.exe 278.16 ms 1.00
All \pr\corerun.exe 23.11 ms 0.08

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set.  However, it's missing a handful of cases we can easily support:
- It currently gives up if the set loop is wrapped in an atomic and/or a capture.
- It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
- If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
- If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns.
@ghost
Copy link

ghost commented Oct 8, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Regex currently has an optimization that looks to see whether the pattern begins with a set loop followed by some literal, in which case it can optimize the search for matches by searching for the literal and then walking backwards through the starting set. However, it's missing a handful of cases we can easily support:

  • It currently gives up if the set loop is wrapped in an atomic and/or a capture.
  • It currently gives up if the literal is a set that's wrapped in an atomic, capture, concatenate, loop, or lazy loop.
  • If the set loop is followed by an ignore-case string, it currently only searches for the starting set of that string, rather than more of it.
  • If the literal is a set, we'd only examine it if it was exactly one iteration (RegexNodeKind.Set) rather than a loop with at least one iteration.

This fixes all of those issues, such that the optimization extends to more patterns. In our regex database, there are currently 189 patterns that lead to using this optimization. With this change, that increases to 331.

Based on benchmark from https://github.com/BurntSushi/rebar#ruff-noqa:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[HideColumns("start", "end", "Error", "StdDev", "Job")]
public partial class Tests
{
    private static readonly Regex s_regex = new Regex("(\\s*)((?i:# noqa)(?::\\s?(([A-Z]+[0-9]+(?:[,\\s]+)?)+))?)", RegexOptions.Compiled);
    private static string s_haystack = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/BurntSushi/rebar/master/benchmarks/haystacks/wild/cpython-226484e4.py").Result;

    [Benchmark]
    public int LineByLine()
    {
        int total = 0;
        foreach (ReadOnlySpan<char> line in s_haystack.AsSpan().EnumerateLines())
        {
            total += s_regex.Count(line);
        }
        return total;
    }

    [Benchmark]
    public int All() => s_regex.Count(s_haystack);
}
Method Toolchain Mean Ratio
LineByLine \main\corerun.exe 308.76 ms 1.00
LineByLine \pr\corerun.exe 66.82 ms 0.22
All \main\corerun.exe 278.16 ms 1.00
All \pr\corerun.exe 23.11 ms 0.08
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 9.0.0

Copy link
Contributor

@buyaa-n buyaa-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at my level

@stephentoub stephentoub merged commit 0cd1774 into dotnet:main Oct 19, 2023
@stephentoub stephentoub deleted the improveliteralafterloopopt branch October 19, 2023 01:25
@ghost ghost locked as resolved and limited conversation to collaborators Nov 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants