-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>: Make wregex correctly match negated character classes
#5403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>: Make wregex correctly match negated character classes
#5403
Conversation
I think we should go with the first solution. Simpler is less risky and more maintainable, and changing buggy behavior in the event of mixing is fine. |
|
Thanks, this is great! 😻 I pushed tiny changes and expanded the test coverage slightly. |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
|
I resolved adjacent-edit conflicts in
|
|
Thanks for figuring out how to fix the unfixable! 🧠 💡 😻 |
Fixes #992 in a binary-compatible way by repurposing some unused NFA node flag bits. Also makes a little bit of progress towards #995.
Even if the solution in this PR is born out of the need to maintain binary compatibility, I think now that it's probably the best one for these negated classes anyway (at least as long as we keep the representation the same for
std::regex_traitsand custom traits classes -- we could save one more bit forstd::regex_traitsspecifically, because we know how the character classes \w, \s and \d intersect).Alternative solutions
char_class_typevalue in the NFA node to represent all specified negated character classes. Sounds like a good idea until you realize that De Morgan's law applies here: For example, we have(not d) or (not w) = not (w and d)for character class[\W\D], so we would have to store the value ofw and d. While the standard guarantees that or'ing character class types makes sense ([re.grammar]/9, there is no such wording for intersecting/and'ing character class types, so and'ing them is allowed to fail for custom traits classes. (One can still make it work forstd::regex_traitsthough because we know how \W, \S and \D relate and control the associatedchar_class_typevalues.) See[\W\D]fails to match alphabetic characters boostorg/regex#241 and [libc++]<regex>: Character class[\W\D]fails to match alphabetic characters llvm/llvm-project#131516 for bug reports related to this.char_class_typein the NFA node for each specified negated character class. Works, but it's quite wasteful.Note that we could choose these solutions as well: We could create a class derived from
_Node_classthat stores thesechar_class_typevalues and repurpose a flag bit to mark that this is the extended version of the_Node_class. But this is a more complicated approach with no clear advantage. (We would have to implement the second alternative if we had to support users adding their own character class escapes like Boost.)What about #5243?
We could apply the same solution for #5243 as well, but it would have an unfortunate side effect: If new parser and old matcher were to be mixed, one buggy behavior would be replaced by another buggy behavior. Currently,
[\w\s]matches alphanumeric characters but fails to match spaces at code points >= 256. If we applied this PR's solution to #5243 as well and the old matcher were to be picked up,[\w\s]would match spaces but fail to match alphanumeric characters at code points >= 256.We can also keep the old buggy behavior in all cases by basically implementing alternative solution 1 above, so a more complicated alternative approach does have a clear advantage here.
So the follow-up question is: Do we go with this solution for #5243 as well, changing the buggy behavior when mixing new parser and old matcher as a side effect, or do we go for one of the more complex alternative solutions that has the advantage that it remains bug-compatible?
Why does this PR also set bits on the root node?
These bits aren't used yet, but the idea is that they tell the matcher to look up these character classes during initialization so that these lookups don't have to happen each time an attempt is made to match a squared character class with a negated character class. We should start making use of these flags when the matcher is renamed. (I think this isn't too far in the future because an efficient fix for #5365 also requires a change to some internal data structures of the matcher.)
Even if we don't use them yet, setting them now has the advantage that we won't have to handle the case in the future that the negated class flags are set on a character class node but are not on the root node.
Why did you leave a small gap between old and new node flags?
I just wanted to leave some room for new flags common to many node types. We will most likely need at least one such flag: One that marks extended versions of NFA nodes. I also defined the flag bits twice to make it clear that they are specific to two node types (_N_begin and _N_class). But feel free to change this if you prefer to do things differently.