KEMBAR78
`<locale>`: Improve `std::collate::do_transform()`'s handling of wrongly encoded input by muellerj2 · Pull Request #5431 · microsoft/STL · GitHub
Skip to content

Conversation

@muellerj2
Copy link
Contributor

Resolves #5210.

Changes:

  • This fixes the comments in xstrxform.cpp and xwcsxfrm.cpp: The functions return SIZE_MAX on error, not INT_MAX like their UCRT counterparts.
  • In _Wcsxfrm(), the return value is now correctly set to static_cast<size_t>(-1) for one error case, bringing it in line with the other error case. Previously, it was set to INT_MAX, which is potentially a valid length for a sort key.
    • While this is clearly a bug in the code, I couldn't actually find any way to exert this error path. (I was a bit disappointed to find that my fix can't actually be observed to fix anything, but on the other hand, this means that we don't have to worry about any potential compatibility impact.)
  • I changed collate::do_transform() to return an empty string in the error case instead of throwing a confusing length_error("string too large").
    • The standard remains completely silent on what collate::do_transform() should do in case of error.
    • The empty string can only really be a valid return value for the empty string, so there isn't any potential for confusion whether the return value represents an actual sort key or an error.
    • Alternatively, we could consciously choose that the function throws an exception. I would prefer this exception to be more appropriate than length_error. Maybe runtime_error?
      • This would mean that the regex implementation would have to defend against this exception, because regex_match and regex_search really shouldn't throw just because the searched input string isn't valid UTF-8.

@muellerj2 muellerj2 requested a review from a team as a code owner April 24, 2025 18:27
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 24, 2025
@StephanTLavavej StephanTLavavej added the bug Something isn't working label Apr 24, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 24, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 An empty string for error seems fine.

@StephanTLavavej StephanTLavavej removed their assignment Apr 24, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 24, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 25, 2025
@StephanTLavavej StephanTLavavej merged commit 1b9594a into microsoft:main Apr 29, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 29, 2025
@StephanTLavavej
Copy link
Member

Thanks for noticing and fixing these long-standing mistakes! 🦅 👁️ 😻

@muellerj2 muellerj2 deleted the locale-collate-handling-invalid-input branch May 31, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

<locale>: std::collate<_Elem>::do_transform() should behave appropriately when _LStrxfrm() fails

2 participants