-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>: Remove usage of non-standard _Uelem from parser
#5592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>: Remove usage of non-standard _Uelem from parser
#5592
Conversation
|
Thanks as always for the extremely detailed writeup and careful changes! 😻 I pushed a conflict-free merge with |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
|
I had to push an additional commit to disable the test for |
|
Ah, the test is trying to exert lots of different paths in the parser with custom types to make sure that the parser doesn't unexpectedly crash. The regex in the test includes an equivalence, but I remember now that we had to disable If you think it's worth it, I will create a follow-up PR that replaces |
|
Thanks for the explanation. I don't think it's worth the effort - if the impact is limited to the test alone, I don't care about |
|
Thanks again for working towards the elimination of this old non-Standard requirement! 😻 🐱 🐈 |
Towards #995. A second PR in the future will remove
_Uelemfrom the matcher.Requirements on the character type
_Elemin the standardEffectively, the standard currently spells out the following guarantees for the character type:
string_typeas abasic_stringin the regex traits class requirements, thetraits_typeof thestring_type(which should bechar_traits<_Elem>) must support all operations in [char.traits.require]._Elemmust be a non-array trivially copyable standard-layout and trivially default-constructible type.Alas, this is woefully underspecified because
regexhas to convert and compare code points (integers) to characters (see also LWG-3835). There is a de facto requirement thatregexmust be able to convert and compare integers to characters, otherwise regexes couldn't be parsed or line endings couldn't be matched. There is also a de facto requirement thatregexmust be able to convert characters to integers again to implement [re.grammar]/12. But how these conversions and comparisons actually work is not specified at all.One idea out might be to rely on the existing
int_typein the character traits type, but the standard (a) does not actually specify any property for this type other than that it can somehow represent all characters +eof()(see [char.traits.typedefs]/1) and (b) immediately goes on to violate this only requirement in the specializations for Unicode character types (see LWG-2959). Moreover, there doesn't appear to be an actual requirement thatint_typeis an integer type -- there is only a guarantee that it can be used through the API of the character traits class. Soint_typedoes not seem helpful, rather, relying on it would just open its own can of worms.Requirements on
_Elemin theregeximplementations before and after this PRThe following fundamental requirements on
_Elemremain unchanged by this PR:lt()andeq()functions in the character traits class.c, then0, .., c-1must be valid code points of characters as well (in the sense that we can obtain a different character object for each of them through casting)._Elemtype is large enough to represent these code points._Elem{}must represent NUL (with code point 0).Requirements on
_Elemin theregeximplementation before this PRThe current implementation makes at least the following additional assumptions on convertibility and comparability:
_Elemmust be equality-comparable to itself,char,intand the internal enum type_Meta_type(maybe including some implicit conversion)._Elemtype must be implicitly convertible tointand_Meta_typemust be implicitly convertible to_Elem.char,intandunsigned intmust be explicitly convertible to_Elem._Elemmust be explicitly convertible to_Meta_type.int, i.e., conversion_Elem->int->_Elemmust yield the original character again._Uelemsuch that explicit conversion to this type yields the code point for any character, i.e., conversion_Elem->_Uelem->_Elemproduces the original character again and the natural ordering of the_Uelemvalues after conversion must be consistent with thelt()andeq()functions in the character traits class. (Note that conversion to an arbitrary but big unsigned integral type does not achieve this if the character type behaves like a signed integral type because the converted value will be sign-extended.)make_unsigned<_Elem>must be well-defined. (This mostly defeats the purpose of_Uelem, because specializingmake_unsigned<_Elem>for user-defined types is forbidden.)This list might be non-exhaustive.
Requirements on
_Elemin theregeximplementation after this PRThis PR imposes the following requirements on comparisons and conversions:
_Elemmust be equality-comparable to itself._Elemmust be explicitly convertible tounsigned charandunsigned int._Elemis an integral or enum type, it must also be explicitly convertible to and frommake_unsigned<_Elem>. Explicit conversion to this character type thus yields the code points of the characters.unsigned intmust yield the unsigned code point for a character, if the code point can fit into anunsigned int._Elemmust behave like an unsigned integer type when converted tounsigned int.char,unsigned charandunsigned intmust be explicitly convertible to_Elem.This list should be exhaustive; the new test checks that we don't do any conversions not listed above.
We cannot just drop most of the special logic for signed integral and enum types because we must support
char.But if desired, we could drop explicit convertibility of
_Elemfrom and tocharandunsigned charin favor ofunsigned int(andmake_unsigned<_Elem>for integral or enum_Elem) only. But this will mean we will have to add even more casts in<regex>.Differences in requirements
Essentially, we have the following new requirements:
_Elemmust be explicitly convertible tounsigned int._Elemmust support explicit conversion to and fromunsigned char._Elemmust convert like an unsigned integer type when explicitly converted tounsigned int, if_Elemis not an integral or enum type. (This requirement is only kind of new: Before this PR,regexdidn't even compile for such types, except when entering UB territory by specializingmake_unsigned.)In exchange, the following requirements will be dropped when the changes are completed for the matcher as well:
_Elemand other integral types._Elemdoes not have to be convertible to and fromintand_Meta_type._Elem._Elemno longer have to fit into anint._Uelemin the regex traits class.make_unsigned<_Elem>for types that are not integral or enum.Changes
_Unescaped_charmember to_Parser2to represent the character represented by some kind of escape sequence.int" requirement, because we no longer have to represent such characters in the_Valmember.static_cast's added to avoid implicit conversions._Meta_typevalues, we first have to cast them tocharbefore casting them to_Elem._Meta_typevalues by temporarily saving the character from the input character sequence.unsigned char._Elem->unsigned char->_Elemyields the original character again if and only if the code point is less than 256.strchr()in_Parser2::_Trans()when the character code point fits into anunsigned char.wregex(probably by accident). It uses the fact that casting to_Meta_typefor non-ASCII code points produces meta values the parser doesn't know about, and the parser treats unknown meta values as non-special. In any case, this change is also necessary to drop the "must fit into anint" requirement.lt()function of the character traits class.unsigned intor similar.sizeof(_Elem) = 1U: The bitmap optimization already handles all characters with code point < 256, so this is just dead code.unsigned int) are done using roundtrip casting._Elemis not an integral or enum type, the small character range optimization is not performed for code points unrepresentable byunsigned int.sizeof(_Elem) > 1U.sizeof(_Elem) > 1Uwith maximal code point < 256. A more accurate check appears much more ugly to me, though, for little gain._Elemin_CharacterEscape():switchon_Elem. All switches are now performed onunsigned char(after checking that the code point is less than 256 while handling the case for code points >= 256 separately) orint(when it is on_Mchar). In the one switch on_Mchar, the value is cast tointfirst to prevent a warning that there aren't case labels for some of the enum identifiers.Test
For now, the test only checks that the parser compiles and doesn't crash on two user-defines character types. We can only check semantic correctness after adjusting the matcher similarly.
The test currently suppresses warning C6510 because #5563 hasn't been merged yet.