KEMBAR78
`<regex>`: Remove non-standard `_Uelem` from matcher by muellerj2 · Pull Request #5671 · microsoft/STL · GitHub
Skip to content

Conversation

@muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Aug 9, 2025

Resolves #995 by eliminating _Uelem completely.

This PR does not quite complete support for custom character types: The internals of regex_search still place some additional requirements on such types. But except for ADL resilience, the support should be complete for regex_match, as the passing test confirms.

<regex> changes

  • Remove usage of _Uelem from _Is_word by converting to unsigned char instead combined with a check that re-conversion to the character type yields the same character again.
    • At this opportunity, _STD-qualify some _Is_word calls.
  • Remove usage of _Uelem from _Lookup_range by relying on the lt() function in the char traits type instead.
    • Since this is possibly called in a tight loop, I added two overloads for char and wchar_t that avoid calling the lt() function for the standard traits type.
  • Remove usage of _Uelem from _Do_class() by converting to unsigned char instead and checking that re-conversion to the character type yields the same character again.
  • Remove requirement for implicit conversions when comparing with ECMAScript line terminators by explicitly converting to unsigned int before comparing with the line terminator code points.
    • The round trip conversion check makes sure that we don't accidentally confuse some code point values larger than 0x100000000 with the line terminators (although I don't know why anyone would ever need that).
    • Signed character types are unproblematic: The code points for line terminators are positive values in all standard C++ signed types, so they will never be sign-extended when converting to unsigned int.
    • Again, I added two overloads for char and wchar_t, but in this case rather because their logic is noticeably simpler.

Test changes

The test coverage for custom character types largely remains rudimentary, but I added some test coverage that matching behaves as it should in places where the parser and matcher rely on potentially narrowing conversions.

The test covers the matcher changes in this PR: Word boundaries (_Is_word), single character matching (_Matcher2::_Do_class), character ranges (_Lookup_range) and line terminators (_Is_ecmascript_line_terminator). Additionally, some of the character ranges are chosen to validate that the implementation of _Builder::_Add_range remains correct for custom character types.

Beyond this, I made the following changes to tests:

  • Removed workaround for <xstring>: Suppress code analysis warning C6510 for basic_string #5563.
  • Removed workarounds for <regex>: basic_regex wants regex_traits to provide things not required by [re.req] #995.
  • wrapped_wchar was turned into a template wrapped_character<Elem> so that it can be used with an unsigned long long character type as well.
  • Removed operator wchar_t() from wrapped_character<Elem> and replaced it by friend functions convert_to<target_type>, which are now called by the test traits classes.
    • This tightens the test coverage for <regex> as it removes implicit conversions from the custom types that only existed to support the implementation the test traits classes.
    • Since ADL doesn't work for function calls with explicit template parameters in C++14 and C++17, I had to move the definitions of the custom character types before the definitions of the traits classes.
  • The test char traits class was changed to accomodate an unsigned long long-like character type.
  • The implementation of transform_primary in the test regex traits was replaced by a dummy implementation to allow the test to run under /clr:pure.

@muellerj2 muellerj2 requested a review from a team as a code owner August 9, 2025 16:27
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Aug 9, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels Aug 9, 2025
@StephanTLavavej StephanTLavavej self-assigned this Aug 9, 2025
@StephanTLavavej StephanTLavavej removed their assignment Aug 11, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Aug 11, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Aug 15, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 1449cee into microsoft:main Aug 16, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Aug 16, 2025
@StephanTLavavej
Copy link
Member

😻 🎉 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

<regex>: basic_regex wants regex_traits to provide things not required by [re.req]

2 participants