KEMBAR78
`<regex>`: Avoid generating unnecessary single-branch `_N_if` nodes by muellerj2 · Pull Request #5539 · microsoft/STL · GitHub
Skip to content

Conversation

muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented May 24, 2025

The parser likes to generate single-branch _N_if nodes in the NFA for no good reason. These nodes have a noticeable runtime cost in the matcher. This PR avoids generating these nodes.

This change has side effects:

  • It strictly reduces the necessary recursion in the matcher to match a specific input string to a specific regex. This means that some strings that ran into a stack overflow or an error_stack exception before will match successfully after this PR.
  • Conversely, this means that some _Matcher2::_Do_rep() calls can now take the place previously occupied by recursive calls of _Matcher2::_Do_if(). _Matcher2::_Do_rep() takes more bytes from the stack (I didn't check recently, but I remember it was about 100 bytes per call?), so there will be a very small set of input strings that will run into an actual stack overflow now instead of an error_stack exception.

Without further changes, there would be more side effects because some loops would get classified as simple now. However, I preempted this reclassification by adjusting _Parser2::_Calculate_loop_simplicity(). There are two reasons for this.

  • The simple loop optimization, as implemented, reduces the required amount of stack space but in exchange can square the running time when matching some input strings to a regex. I think, though, that we could take this risk because the input strings couldn't be longer than a few hundred characters anyway before running into a stack overflow, so it's unlikely to make a major runtime difference in practice when matching didn't run into a stack overflow or error_stack exception before, and it's more important to make more regexes resistant against stack overflow. For example, if we were to allow the reclassification, this would fix the stack overflow for the regex in <regex>: Grouping within repetition causes regex stack error #997.
  • The more important reason is that I'm not sure yet what the best subset of simple loops is for a non-recursive matcher, so I would like to retain more wiggle room here until the requirements become clearer while rewriting the matcher.

I added no new tests because (a) there is no obvious additional set of regexes to test and (b) this change affects the generated NFAs for all regexes without exception (all of them contained at least one single-branch _N_if node before), so the whole existing test suite serves as test coverage for this change.

Benchmark

Benchmark Before After Speedup
bm_lorem_search/"bibe"/2 31145 ns 28878 ns 1.08
bm_lorem_search/"bibe"/3 61384 ns 57199 ns 1.07
bm_lorem_search/"bibe"/4 122768 ns 114397 ns 1.07
bm_lorem_search/"(bibe)"/2 57812 ns 43945 ns 1.32
bm_lorem_search/"(bibe)"/3 141246 ns 92072 ns 1.53
bm_lorem_search/"(bibe)"/4 284934 ns 187976 ns 1.52
bm_lorem_search/"(bibe)+"/2 92072 ns 62779 ns 1.47
bm_lorem_search/"(bibe)+"/3 179983 ns 122768 ns 1.47
bm_lorem_search/"(bibe)+"/4 376607 ns 256319 ns 1.47
bm_lorem_search/"(?:bibe)+"/2 55804 ns 48652 ns 1.15
bm_lorem_search/"(?:bibe)+"/3 112305 ns 96257 ns 1.17
bm_lorem_search/"(?:bibe)+"/4 224609 ns 196725 ns 1.14

@muellerj2 muellerj2 requested a review from a team as a code owner May 24, 2025 12:57
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 24, 2025
@muellerj2 muellerj2 force-pushed the regex-avoid-generating-single-branch-ifs branch from 8314d2a to db37000 Compare May 24, 2025 13:00
@StephanTLavavej StephanTLavavej self-assigned this May 27, 2025
@StephanTLavavej StephanTLavavej added performance Must go faster regex meow is a substring of homeowner labels May 27, 2025
case _N_capture:
// TRANSITION, requires more research to decide on the subset of loops that we can make simple:
// - Simple mode can square the running time when matching a regex to an input string in the current matcher
// - The optimal subset of simple loops for a non-recursive rewrite of the matcher aren't clear yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say "The optimal subset ... isn't clear yet" but it isn't worth resetting testing.

@StephanTLavavej StephanTLavavej removed their assignment May 28, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 28, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 28, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 1b1f5b0 into microsoft:main May 28, 2025
48 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 28, 2025
@StephanTLavavej
Copy link
Member

🚀 ⏱️ 💚

@muellerj2 muellerj2 deleted the regex-avoid-generating-single-branch-ifs branch May 31, 2025 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants