KEMBAR78
`<regex>`: Process minimum number of reps in simple loops non-recursively by muellerj2 · Pull Request #5762 · microsoft/STL · GitHub
Skip to content

Conversation

@muellerj2
Copy link
Contributor

Towards #997 and #1528.

Simple loops are a subset of all loops with the following properties:

  1. In each trajectory in the NFA, the surrounding loops will not be repeated.
  2. As a corollary, simple loops can be entered at most once in each trajectory.
  3. The repeated pattern is branchless (no _Do_if node, no character class matching collating elements of various sizes).
  4. The repeated pattern always matches strings of the same length.
  5. The repeated pattern does not introduce new captures. (But I'm not sure whether I want to lift this restriction in the future, so I haven't taken advantage of this property yet.)

Properties 1-3 mean that we don't have to worry about restoring loop state: Whenever we enter the loop, we can just overwrite it because the data relates to a trajectory the processing of which is complete. Moreover, the repeated pattern is branchless, so it can't happen that we try some first branch in the repeated pattern, have to backtrack and then have to restore the loop state (especially any counters) before trying another branch in the repeated pattern.

As long as we haven't reached the minimum number of reps, we must match more repetitions, so we do not try to match the tail of the regex. Thus, there is also no branch after matching a repetition until the minimum is reached. This means that we don't have to add any stack frames with non-trivial unwinding logic that try the other branch or restore the loop state during backtracking.

For now, this results in the following implementation to match the minimum number of reps in a simple loop:

  • Matching _N_rep just sets up the loop state for the first repetition if at least one repetition is necessary. No unwinding logic is added to restore the loop state because the state relates to an outdated trajectory (or is just the initial state). If this isn't a simple loop or a simple one which doesn't have to match at least one rep, we essentially delegate to the old code.
  • When matching _N_end_rep, we first check whether this is a simple loop and whether we are still processing the minimum number of reps.
    • If this isn't a simple loop, we delegate to _Do_rep().
    • If this is a simple loop, but we aren't processing the minimum number of reps of a simple loop anymore, then this is a recursive _Match_pat(_Node->_Next) call in _Do_rep0(), so we have to keep doing what _N_end_rep did previously in this case: Set _Next to nullptr so that we exit _Match_pat().
    • If we are still processing the minimum number of reps and we just completed the first repetition, we have to check whether the first repetition matched an empty string. If so, we know that this repeated pattern always matches an empty string, so we can just stop repeating the pattern and immediately try matching the tail of the regex (non-recursively).
    • If we have to do one more repetition to reach the minimum, we reset the capturing groups, increase the loop counter and set _Next to the start of the repeated pattern.
    • If we have reached the minimum number of repetitions, we hand off the processing to the old _Do_rep0() code.

We don't have to run any stack unwinding logic during backtracking to process the minimum number of reps of a simple loop, but we still have to add a stack frame to save the match state at the time when the simple loop was entered. For this reason, we add a new opcode for doing nothing during stack unwinding. Since we can leave this stack frame to the standard unwinding logic in _Match_pat() now, we can remove all explicit _Pop_frame() calls in _Do_rep0().

Repetitions of simple loops didn't really increase the stack usage counter (other than some temporary increase by 1) prior to this PR, so to reduce complexity I have decided to not replicate the updates to the stack usage counter in the new logic. Even so, this PR replicates the updates to the complexity counter. (This PR also adds an update to the complexity counter to the logic of _Disjunction_eval_alt_always which was missing from #5745.)

@muellerj2 muellerj2 requested a review from a team as a code owner October 4, 2025 09:13
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Oct 4, 2025
@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex meow is a substring of homeowner labels Oct 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Oct 4, 2025
Comment on lines +4173 to +4180
_Failed = !_Do_rep0(_Nr, (_Nr->_Flags & _Fl_greedy) != 0);
_Next = nullptr;
}
} else { // internal _Match_pat(_Node->_Next) call in _Do_rep0()
_Next = nullptr;
}
} else {
_Failed = !_Do_rep(_Nr, (_Nr->_Flags & _Fl_greedy) != 0, _Sav._Loop_idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible followup, no change requested for this PR: There are now two mentions of (_Nr->_Flags & _Fl_greedy) != 0. I'm assuming this doesn't change, so we should consider extracting it as bool _Greedy.

Copy link
Contributor Author

@muellerj2 muellerj2 Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably a good idea. I just didn't spend too much thought on making the recursive calls look pretty because I intend to remove them soon. If the logic for simple or generic loops mostly remains separate, though, this expression will still appear in two places.

@StephanTLavavej
Copy link
Member

Thank you as always for the extremely detailed description which will be helpful in understanding this refactoring later! 😻

@StephanTLavavej StephanTLavavej removed their assignment Oct 8, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Oct 8, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Oct 9, 2025
@StephanTLavavej StephanTLavavej merged commit 616c862 into microsoft:main Oct 10, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Oct 10, 2025
@StephanTLavavej
Copy link
Member

💚 🐈 🐱

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Something can be improved regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants