-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>
: Clean up parsing logic for quantifiers
#5253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Clean up parsing logic for quantifiers
#5253
Conversation
We weren't in the habit of changing |
Thanks for the improvement, detailed PR description, and test coverage! 😻 I especially appreciate the careful attention to preserving ABI during coordinated member function changes. I pushed a conflict-free merge with |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
When I merge this, I will resolve a stealth merge conflict where #5165 added a call to |
Closing and reopening to (hopefully) wake up the CLA Bot. |
Thanks for simplifying the implementation! 😻 ⭐ 🎉 |
The parser uses some confusing control flow to generate the NFA for quantifiers. The control flow goes like this: The parser sets the
_Fl_final
flag on the last node generated before processing the quantifier (by calling_Builder::_Mark_final()
in_Parser::_Quantifier()
). Then it is checked in_Builder::_Add_rep()
whether the last generated node is a node that matches a string with more than one character. If so, the function removes the last character from this node and then adds the character to the NFA again by calling_Builder::_Add_char()
. Because the_Fl_final
flag is set on the last string-matching node (that we just removed a character from),_Builder::_Add_char()
doesn't just re-add the character to the string-matching node again, but instead creates a new string node that matches this one character. Finally, the repetition is added to the NFA.There are no other usages of the
_Fl_final
in<regex>
, so supporting this confusing parsing and NFA generation logic seems to be its only purpose.This PR cleans up this logic and removes any usage of the
_Fl_final
flag: When the last generated node is a string-matching node with more than one character,_Builder::_Add_rep()
now explicitly calls_Builder::_Add_str_node()
to generate a new string-matching node (so no more confusing checking of the_Fl_final
flag) and then moves the last character from the previous string-matching node to the newly created one (by calling_Buf::_Del()
and_Builder::_Add_char2()
)._Builder::_Add_char()
and_Builder::_Add_rep()
are renamed, otherwise regular expressions could be miscompiled when old and new functions are mixed. The_Builder::_Mark_final()
function is no longer used and thus removed. The new tests make sure we don't accidentally miscompile quantifiers (e.g., by not moving the last character from a preceding string-matching node to a new one, or by incorporating too few or too many nodes in the repetition).The
_Fl_final
flag is no longer set on nodes, but I think this flag was never used by the matcher. I can't come up with any reasonable use for it when matching (the flag marks the last node in a repetition, except when it marks the last node before a repetition) and I also couldn't find any usage of the flag in the matcher in the oldest toolset 14.16.27023 (VS 2017) I still had on one of my machines. But it would prudent to check the very first VS 2015 toolset as well.Alternatively, we can also avoid the source code archaeology and just insert the following line at the very beginning of
_Builder::_Add_rep2()
:_Current->_Flags |= _Fl_final; // TRANSITION, ABI
Then the flag gets set on the same nodes before and after this PR.