KEMBAR78
`<regex>`: Process positive lookahead assertions non-recursively by muellerj2 · Pull Request #5714 · microsoft/STL · GitHub
Skip to content

Conversation

@muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Sep 10, 2025

Towards #997 and #1528. While the PR title describes the main observable effect, the actual main change is the implementation of manual unwinding of stack frames in _Match_pat. But at least one recursive call had to be replaced by the manual stack management to validate the main change, and positive assertions turned out be the easiest option.

Previously, _Match_pat mainly consisted of an NFA interpreter loop. After this PR, _Match_pat consists of two main loops: An NFA interpreter loop and a stack unwinding loop (joined together by a loop surrounding both). As before this PR, the interpreter loop in _Match_pat processes the nodes in the NFA. When this loop is done because some final node of the NFA has been reached or matching along a trajectory failed, a second loop is entered that manually unwinds the explicit state stack. If that second loop leads to some interpretable position in the NFA again (i.e., if _Nx becomes not null), the unwinding loop is exited and the interpreter loop is engaged again. If not and the stack has been fully unwound, _Match_pat is exited.

Because the NFA node following the node _Nx in the interpreter loop might have to be some node other than _Nx->_Next from now on, a new local variable _Next now stores the next node to process, which gets assigned to the correct next node for a each node type in the switch if it's not _Nx->_Next. In this PR specifically, this is used to make static_cast<_Node_assert*>(_Nx)->_Child follow a node of type _N_assert in the interpreter loop. (Additionally, final nodes in the interpreter loop that do not result in failure assign nullptr to _Next rather than _Nx now.)

The stack unwinding uses its own set of operation codes, which are interpreted by the unwinding loop. The operation codes are stored in the stack frames at the time the new frame is pushed to the stack. (After this PR, there is only one code, but there will soon be more.)

Because the matcher is currently semi-recursive, the stack counts as unwound in _Match_pat if it has been unwound up to its size at the time the _Match_pat call started. Further unwinding will happen in a surrounding _Matcher_pat call. (We can simplify this when the matcher is finally fully non-recursive.)

The stack frames on the heap are now represented by objects of type _Rx_state_frame_t. Currently, this is a very inefficient structure and it will become even worse in the next few PRs before it starts getting better.

For now, I opted to exactly preserve the situations when regex_errors with error_stack or error_complexity get thrown. But I moved the related code to their own member functions to avoid unnecessary code duplication. We can think about changing this after the matcher has been made fully non-recursive.

@muellerj2 muellerj2 requested a review from a team as a code owner September 10, 2025 22:15
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Sep 10, 2025
@muellerj2 muellerj2 force-pushed the regex-process-positive-lookahead-assertions-nonrecursively branch from 524a683 to a066a6b Compare September 10, 2025 22:22
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Sep 11, 2025
@StephanTLavavej StephanTLavavej self-assigned this Sep 11, 2025
@StephanTLavavej StephanTLavavej removed their assignment Sep 17, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Sep 17, 2025
@StephanTLavavej
Copy link
Member

Thanks as always for the detailed explanation! 😻

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Sep 19, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 3d2b494 into microsoft:main Sep 22, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Sep 22, 2025
@StephanTLavavej
Copy link
Member

Thanks for taking the first step towards this long-thought-impossible goal! 😻 🏃 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Something can be improved regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants