`<regex>`: Remove usage of non-standard `_Uelem` from parser #5592

muellerj2 · 2025-06-15T17:41:42Z

Towards #995. A second PR in the future will remove _Uelem from the matcher.

Requirements on the character type `_Elem` in the standard

Effectively, the standard currently spells out the following guarantees for the character type:

The regex traits class can handle the character type for all the operations in [re.traits].
Indirectly, through the reference to string_type as a basic_string in the regex traits class requirements, the traits_type of the string_type (which should be char_traits<_Elem>) must support all operations in [char.traits.require].
According to [strings.general]/1 referenced by [re.general]/2, _Elem must be a non-array trivially copyable standard-layout and trivially default-constructible type.

Alas, this is woefully underspecified because regex has to convert and compare code points (integers) to characters (see also LWG-3835). There is a de facto requirement that regex must be able to convert and compare integers to characters, otherwise regexes couldn't be parsed or line endings couldn't be matched. There is also a de facto requirement that regex must be able to convert characters to integers again to implement [re.grammar]/12. But how these conversions and comparisons actually work is not specified at all.

One idea out might be to rely on the existing int_type in the character traits type, but the standard (a) does not actually specify any property for this type other than that it can somehow represent all characters + eof() (see [char.traits.typedefs]/1) and (b) immediately goes on to violate this only requirement in the specializations for Unicode character types (see LWG-2959). Moreover, there doesn't appear to be an actual requirement that int_type is an integer type -- there is only a guarantee that it can be used through the API of the character traits class. So int_type does not seem helpful, rather, relying on it would just open its own can of worms.

Requirements on `_Elem` in the `regex` implementations before and after this PR

The following fundamental requirements on _Elem remain unchanged by this PR:

All integral conversions must be consistent with some implicit one-to-one-mapping between unsigned integers (code points) and characters (up to lossiness and sign extension), and the natural ordering of these unsigned integers must be consistent with the lt() and eq() functions in the character traits class.
Lossy casts to and from integers are allowed and behave as one would expect for casts between integers.
There must not be any gaps between code points, i.e., if a character has code point c, then 0, .., c-1 must be valid code points of characters as well (in the sense that we can obtain a different character object for each of them through casting).
The code points for the basic character set (to the degree they have special meaning in regexes) plus the paragraph and line separator must agree with the corresponding Unicode code points. For the paragraph and line separator, this must only be the case if the _Elem type is large enough to represent these code points.
_Elem{} must represent NUL (with code point 0).

Requirements on `_Elem` in the `regex` implementation before this PR

The current implementation makes at least the following additional assumptions on convertibility and comparability:

_Elem must be equality-comparable to itself, char, int and the internal enum type _Meta_type (maybe including some implicit conversion).
_Elem type must be implicitly convertible to int and _Meta_type must be implicitly convertible to _Elem.
char, int and unsigned int must be explicitly convertible to _Elem.
_Elem must be explicitly convertible to _Meta_type.
The code point for any character can fit into an int, i.e., conversion _Elem -> int -> _Elem must yield the original character again.
The character traits must provide an unsigned integral type _Uelem such that explicit conversion to this type yields the code point for any character, i.e., conversion _Elem -> _Uelem -> _Elem produces the original character again and the natural ordering of the _Uelem values after conversion must be consistent with the lt() and eq() functions in the character traits class. (Note that conversion to an arbitrary but big unsigned integral type does not achieve this if the character type behaves like a signed integral type because the converted value will be sign-extended.)
make_unsigned<_Elem> must be well-defined. (This mostly defeats the purpose of _Uelem, because specializing make_unsigned<_Elem> for user-defined types is forbidden.)

This list might be non-exhaustive.

Requirements on `_Elem` in the `regex` implementation after this PR

This PR imposes the following requirements on comparisons and conversions:

_Elem must be equality-comparable to itself.
_Elem must be explicitly convertible to unsigned char and unsigned int.
- If _Elem is an integral or enum type, it must also be explicitly convertible to and from make_unsigned<_Elem>. Explicit conversion to this character type thus yields the code points of the characters.
- Otherwise, conversion to unsigned int must yield the unsigned code point for a character, if the code point can fit into an unsigned int.
  - This means that _Elem must behave like an unsigned integer type when converted to unsigned int.
char, unsigned char and unsigned int must be explicitly convertible to _Elem.

This list should be exhaustive; the new test checks that we don't do any conversions not listed above.

We cannot just drop most of the special logic for signed integral and enum types because we must support char.

But if desired, we could drop explicit convertibility of _Elem from and to char and unsigned char in favor of unsigned int (and make_unsigned<_Elem> for integral or enum _Elem) only. But this will mean we will have to add even more casts in <regex>.

Differences in requirements

Essentially, we have the following new requirements:

_Elem must be explicitly convertible to unsigned int.
_Elem must support explicit conversion to and from unsigned char.
_Elem must convert like an unsigned integer type when explicitly converted to unsigned int, if _Elem is not an integral or enum type. (This requirement is only kind of new: Before this PR, regex didn't even compile for such types, except when entering UB territory by specializing make_unsigned.)

In exchange, the following requirements will be dropped when the changes are completed for the matcher as well:

No more equality-comparability between _Elem and other integral types.
_Elem does not have to be convertible to and from int and _Meta_type.
No more implicit conversions from and to _Elem.
The code points for _Elem no longer have to fit into an int.
No more _Uelem in the regex traits class.
No more reliance on make_unsigned<_Elem> for types that are not integral or enum.

Changes

Add new _Unescaped_char member to _Parser2 to represent the character represented by some kind of escape sequence.
- This allows us to drop the "must fit into an int" requirement, because we no longer have to represent such characters in the _Val member.
Lots of static_cast's added to avoid implicit conversions.
- For many _Meta_type values, we first have to cast them to char before casting them to _Elem.
- In two cases, I avoided casts from _Meta_type values by temporarily saving the character from the input character sequence.
To decide whether a character can be stored in the bitmap, the test checks now whether a character code point fits into an unsigned char.
- Such tests are done by roundtrip casting: _Elem -> unsigned char -> _Elem yields the original character again if and only if the code point is less than 256.
Only call strchr() in _Parser2::_Trans() when the character code point fits into an unsigned char.
- It's actually surprising that this logic worked for wregex (probably by accident). It uses the fact that casting to _Meta_type for non-ASCII code points produces meta values the parser doesn't know about, and the parser treats unknown meta values as non-special. In any case, this change is also necessary to drop the "must fit into an int" requirement.
Replace some ordering checks by calls to the lt() function of the character traits class.
- This lets us avoid some casting and case distinction between integral/enum types and other types. We can also avoid the addition of a requirement that the character code points must fit into an unsigned int or similar.
Rework the logic for the bitmap optimization and small character range optimization for the new requirements.
- The code for the small character range optimization is no longer generated if sizeof(_Elem) = 1U: The bitmap optimization already handles all characters with code point < 256, so this is just dead code.
- Checks whether character code points are too large (to store the character in the bitmap or the code point in unsigned int) are done using roundtrip casting.
- We have to differentiate between integral/enum types and other user-defined types for the small character range optimization, as we have to cast the characters to different unsigned integer types.
  - When _Elem is not an integral or enum type, the small character range optimization is not performed for code points unrepresentable by unsigned int.
Store the named character classes and equivalence classes if sizeof(_Elem) > 1U.
- This is sufficient (due to the requirement that code points must have no gaps), but not necessary: We could have a character type with sizeof(_Elem) > 1U with maximal code point < 256. A more accurate check appears much more ugly to me, though, for little gain.
Revise logic to cast integer value (from escape sequence) to an _Elem in _CharacterEscape():
- Again, we have to differentiate between integral/enum types and other user-defined types to perform the correct casts.
- We check whether the integer value corresponds to a code point by performing roundtrip and testing whether it yields the same integer value again.
No more switch on _Elem. All switches are now performed on unsigned char (after checking that the code point is less than 256 while handling the case for code points >= 256 separately) or int (when it is on _Mchar). In the one switch on _Mchar, the value is cast to int first to prevent a warning that there aren't case labels for some of the enum identifiers.

Test

For now, the test only checks that the parser compiles and doesn't crash on two user-defines character types. We can only check semantic correctness after adjusting the matcher similarly.

The test currently suppresses warning C6510 because #5563 hasn't been merged yet.

…rimary`.

tests/std/tests/GH_000995_regex_custom_char_types/test.cpp

StephanTLavavej · 2025-08-07T21:50:57Z

Thanks as always for the extremely detailed writeup and careful changes! 😻 I pushed a conflict-free merge with main, followed by minor nitpicks, a theoretical correctness fix for the test's custom char traits move, and a fix for what I believe was a copy-paste typoed function call - please double-check.

StephanTLavavej · 2025-08-07T22:17:22Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-08-08T08:24:52Z

I had to push an additional commit to disable the test for /clr:pure. It's throwing an exception and I don't know or care why.

Unhandled Exception: System.Runtime.InteropServices.SEHException: External component has thrown an exception.
   at _CxxThrowException(Void* , _s__ThrowInfo* )
   at std._Xregex_error(error_type _Code)
   at std._Builder2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._Add_equiv(_Builder2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* , signed_wchar_enum* A_0, signed_wchar_enum* A_1)
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._Do_ex_class(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* , _Meta_type A_0)
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._ClassAtom(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* , Boolean A_0)
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._ClassRanges(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* )
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._CharacterClass(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* )
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._Alternative(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* )
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._Disjunction(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* )
   at std._Parser2<enum signed_wchar_enum const *,enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._Compile(_Parser2<enum signed_wchar_enum const \*\,enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* )
   at std.basic_regex<enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >._Reset<enum signed_wchar_enum const *>(basic_regex<enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* , signed_wchar_enum* _First, signed_wchar_enum* _Last, syntax_option_type _Flags)
   at std.basic_regex<enum signed_wchar_enum,test_regex_traits<enum signed_wchar_enum,wchar_t> >.{ctor}<struct std::char_traits<enum signed_wchar_enum>,class std::allocator<enum signed_wchar_enum> >(basic_regex<enum signed_wchar_enum\,test_regex_traits<enum signed_wchar_enum\,wchar_t> >* , basic_string<enum signed_wchar_enum\,std::char_traits<enum signed_wchar_enum>\,std::allocator<enum signed_wchar_enum> >* _Str, syntax_option_type _Flags)
   at test_gh_5592()
   at mainCRTStartup(String[] arguments)

muellerj2 · 2025-08-08T11:21:47Z

Ah, the test is trying to exert lots of different paths in the parser with custom types to make sure that the parser doesn't unexpectedly crash. The regex in the test includes an equivalence, but I remember now that we had to disable regex_traits::transform_primary for /clr:pure because of its reliance on RTTI, so parsing the regex fails for /clr:pure.

If you think it's worth it, I will create a follow-up PR that replaces transform_primary in the custom traits class by some dummy implementation that doesn't rely on regex_traits::transform_primary.

StephanTLavavej · 2025-08-08T11:31:25Z

Thanks for the explanation. I don't think it's worth the effort - if the impact is limited to the test alone, I don't care about /clr:pure coverage at all. (We're keeping that mode in life support for one legacy project.)

StephanTLavavej · 2025-08-08T17:05:49Z

Thanks again for working towards the elimination of this old non-Standard requirement! 😻 🐱 🐈

<regex>: Remove usage of non-standard _Uelem from parser

084fff7

muellerj2 requested a review from a team as a code owner June 15, 2025 17:41

github-project-automation bot added this to STL Code Reviews Jun 15, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Jun 15, 2025

StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Jun 16, 2025

StephanTLavavej self-assigned this Jun 16, 2025

StephanTLavavej added 10 commits August 7, 2025 07:49

Merge branch 'main' into regex-remove-uelem-from-parser

92e0733

signed short => short

4e9b66c

Include more headers.

97511ac

Drop std qualification.

59625cd

Make comment pretty.

2bc8319

Use rx_traits alias, make inner private.

15a46c1

Iterate in-place.

9a7bc4b

Drop newline.

866de4f

custom_char_traits::move() should handle overlapping ranges.

ab1e9e6

test_regex_traits::transform_primary should wrap `inner.transform_p…

05bec27

…rimary`.

StephanTLavavej reviewed Aug 7, 2025

View reviewed changes

StephanTLavavej approved these changes Aug 7, 2025

View reviewed changes

StephanTLavavej removed their assignment Aug 7, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Aug 7, 2025

StephanTLavavej mentioned this pull request Aug 7, 2025

Maintainer priorities #4700

Open

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Aug 7, 2025

Fix /clr:pure.

ecaa06d

StephanTLavavej approved these changes Aug 8, 2025

View reviewed changes

StephanTLavavej merged commit 4525436 into microsoft:main Aug 8, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Aug 8, 2025

StephanTLavavej mentioned this pull request Aug 11, 2025

<regex>: After _Uelem changes, Build2 asserts in the Real World Code test suite #5673

Closed

StephanTLavavej added a commit to StephanTLavavej/build2 that referenced this pull request Aug 11, 2025

Update line_char to handle microsoft/STL#5592.

facdd7f

StephanTLavavej mentioned this pull request Aug 13, 2025

Update line_char to tolerate <regex> round-trip conversions build2/build2#478

Closed

build2-bot pushed a commit to build2/build2 that referenced this pull request Aug 13, 2025

Update line_char to handle changes in microsoft/STL#5592

68f80c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`<regex>`: Remove usage of non-standard `_Uelem` from parser #5592

`<regex>`: Remove usage of non-standard `_Uelem` from parser #5592

Uh oh!

muellerj2 commented Jun 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Aug 7, 2025

Uh oh!

StephanTLavavej commented Aug 7, 2025

Uh oh!

StephanTLavavej commented Aug 8, 2025

Uh oh!

muellerj2 commented Aug 8, 2025

Uh oh!

StephanTLavavej commented Aug 8, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

<regex>: Remove usage of non-standard _Uelem from parser #5592

<regex>: Remove usage of non-standard _Uelem from parser #5592

Uh oh!

Conversation

muellerj2 commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements on the character type _Elem in the standard

Requirements on _Elem in the regex implementations before and after this PR

Requirements on _Elem in the regex implementation before this PR

Requirements on _Elem in the regex implementation after this PR

Differences in requirements

Changes

Test

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Aug 7, 2025

Uh oh!

StephanTLavavej commented Aug 7, 2025

Uh oh!

StephanTLavavej commented Aug 8, 2025

Uh oh!

muellerj2 commented Aug 8, 2025

Uh oh!

StephanTLavavej commented Aug 8, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`<regex>`: Remove usage of non-standard `_Uelem` from parser #5592

`<regex>`: Remove usage of non-standard `_Uelem` from parser #5592

muellerj2 commented Jun 15, 2025 •

edited

Loading

Requirements on the character type `_Elem` in the standard

Requirements on `_Elem` in the `regex` implementations before and after this PR

Requirements on `_Elem` in the `regex` implementation before this PR

Requirements on `_Elem` in the `regex` implementation after this PR