KEMBAR78
`<regex>`: Correct character translation in `icase` and `collate` mode by muellerj2 · Pull Request #5553 · microsoft/STL · GitHub
Skip to content

Conversation

@muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented May 29, 2025

In icase and collate mode, characters are supposed to be passed through translate_nocase() and translate() of the traits class before comparing them. This PR makes sure we always apply these translations exactly once before any character or string comparison.

The test deliberately defines non-idempotent translation functions. They are very weird, but they make it easy to catch any repeated or unbalanced applications of these functions before character or string comparisons.

This PR also replaces _Cmp_cs by equal_to. While this means for now that a static call is replaced by a non-static one, there is already machinery in place to recognize when equal_to can be vectorized, and vectorization will help us significantly improve performance in one of the following PRs.

@muellerj2 muellerj2 requested a review from a team as a code owner May 29, 2025 13:26
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 29, 2025
@StephanTLavavej StephanTLavavej added bug Something isn't working regex meow is a substring of homeowner labels May 29, 2025
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews May 29, 2025
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews May 29, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 29, 2025
@muellerj2
Copy link
Contributor Author

I realized later that the tests didn't cover two cases related character ranges in icase mode:

  • A non-collating range that triggers the small range optimization (right bound - left bound < 4 after translation).
  • A non-collating range that triggers neither the small range optimization nor the bitmap optimization.

For the idempotent translation functions used in the test, this means we need code points with values >= 0x200 for the lower-case variants, so I had to make use of wide strings for the two additional tests.

@jovibor
Copy link
Contributor

jovibor commented May 30, 2025

@muellerj2
The number of <regex> bug fixes you've introduced for the last few months is a bit shocking.
Is <regex> really that buggy, how did it even functionate?🤐
Anyway, good job and thanks for making the STL better for all of us.

@muellerj2
Copy link
Contributor Author

Is really that buggy, how did it even functionate?🤐

That's relatively easy to say: Most of the fixed bugs are or were specific to some subclasses of regexes (rarely used syntax options, specific escapes, wregex with specific character classes, usage of collating elements, icase or collate mode etc.), so if your regex wasn't in any of these subclasses, you weren't affected.

The bug fixed by this PR, for example, would only be observed if the traits classed provided (a) a non-trivial (or worse a non-idempotent) translate() function or (b) a non-idempotent translate_nocase() function. I'm not aware of any code that defined such translation functions. Well, except for the test code in this PR, but translation functions in the test code are deliberately weird to catch that they aren't applied exactly once. They aren't designed to do anything useful.

@StephanTLavavej
Copy link
Member

Thanks! 😻 I pushed cosmetic changes.

@StephanTLavavej StephanTLavavej removed their assignment Jun 10, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Jun 10, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Jun 11, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit b1dbdfe into microsoft:main Jun 14, 2025
48 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Jun 14, 2025
@StephanTLavavej
Copy link
Member

Thanks for fixing this bug, obscure though it may be! 🐞 🕵️ 😻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants