-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>: Correctly reset capture groups for ECMAScript grammar
#5456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
StephanTLavavej
merged 13 commits into
microsoft:main
from
muellerj2:regex-fix-capture-group-reset
May 10, 2025
Merged
<regex>: Correctly reset capture groups for ECMAScript grammar
#5456
StephanTLavavej
merged 13 commits into
microsoft:main
from
muellerj2:regex-fix-capture-group-reset
May 10, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…its uses by `_St._Cur`
…re groups before each repetition
b86b415 to
90f84c2
Compare
<regex>: Correctly reset capture groups for ECMAScript grammars<regex>: Correctly reset capture groups for ECMAScript grammar
muellerj2
commented
May 1, 2025
StephanTLavavej
approved these changes
May 8, 2025
|
Thanks, this is great! 😻 I pushed a conflict-free merge with |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
StephanTLavavej
added a commit
to StephanTLavavej/STL
that referenced
this pull request
May 9, 2025
|
Thanks for resetting all of these bugs! ⏪ 🐞 😹 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5365 and does some preparatory work for #174.
ECMAScript requires that all capture groups inside a loop are reset to "unmatched" at the start of each repetition. To do this correctly, we have to know the first capture group that appears inside the loop. We actually do not have to know the last one to reset the capture groups correctly: If a loop is repeated, then all capture groups between the first capture group inside the loop and the last one in the regex appear inside this loop or are already unmatched. (Knowing the last capture group inside the loop would avoid some unnecessary work, but given that the number of capture groups is usually low in practice it's not clear that this additional work is worth it when traded against more NFA traversal at the time of matching. This would require some benchmarking.)
To find the first capture group, the matcher now traverses the NFA until it encounters a capture group or has visited all of the nodes inside a loop. For POSIX regexes, the traversal is skipped; instead the first capture group is set to the number of capture groups in the whole expression, which means that no capture groups will be reset when processing loops as required by POSIX semantics.
To avoid this NFA traversal for each repetition, the result of this traversal is now cached in the loop state by extending the type
_Loop_vals_t. But this means the type_Loop_vals_tand_Matcherhave to be renamed for binary compatibility. At this opportunity, I also make some minor cleanups.I split this PR into several commits to make reviewing easier:
_Matcherto_Matcher2and add a template parameter for an allocator to prepare for <regex>: Should regex_match() use the match_results allocator? #174 as suggested by STL. For now, I'm passingvoidas the allocator argument because it's obviously not an allocator, so we don't have to worry about ABI when allocator support is actually added._Ncapmember variable: It should beunsigned int, notint._Get_ncap(): It has become superfluous because the type of_Ncaphas been corrected._Firstin favor of_Begin(to match_End)._Cur_iterin_Do_repto minimally reduce stack consumption.<regex>: Implementation divergence for capture group behavior #5365:_Loop_vals_tand rename it to_Loop_vals_v2_t._Find_first_inner_capture_groupthat recursively traverses the NFA to find the first capture group. The recursive traversal is protected against stack overflow like all other recursive calls in the matcher._Do_repinto_Do_rep(doing the main work for any repetition) and_Do_rep_first(doing the preparation before the first repetition)._Do_repto reset capture groups._Do_rep0to reset capture groups in one for loop. (They already are reset correctly in the other for loop.)_Do_rep_firstto perform the NFA traversal if necessary and update the loop state accordingly.