KEMBAR78
`<regex>`: Correct characters not matched by special character dot by muellerj2 · Pull Request #5192 · microsoft/STL · GitHub
Skip to content

Conversation

@muellerj2
Copy link
Contributor

@muellerj2 muellerj2 commented Dec 16, 2024

This corrects the set of characters the special character dot . does not match in a regular expression as specified in the ECMAScript and POSIX standards, and aligns our treatment of . with libstdc++ and libc++.

  • Adds U+2028 Line Separator and U+2029 Paragraph Separator as characters not matched by . in a wregex in ECMAScript mode. See the definition of . semantics in Section 22.2.2.7 of ECMAScript 14, which removes the line terminators from the set of matched characters, and the list of line terminators in Section 12.3. (Note that this links to a newer standard, but the set of unmatched characters has not been changed since ECMAScript 3. Furthermore, the C++ standard does not modify the interpretation of ..)
  • In all other modes, . matches all characters except NUL now. This is in accordance with Section 9.3.4 and Section 9.4.4 of the POSIX standard. (I contemplated whether a new line (LF) should not be matched in addition to or instead of NUL in grep or egrep mode, as that is what grep implementations tend to do. The POSIX standard only states that regular expressions cannot match LFs due to the way grep works, but does not explicitly modify the definition of . or regular expressions in general, so it is ambiguous on this question. Since libstdc++ and libc++ only exclude NUL from the set of characters matched by . in grep and egrep mode, I decided to align the set of unmatched characters with them.)

Note: Whether NUL should be matched in POSIX regular expressions is the subject of LWG-3603.

@muellerj2 muellerj2 requested a review from a team as a code owner December 16, 2024 14:26
@CaseyCarter CaseyCarter added the bug Something isn't working label Dec 17, 2024
@StephanTLavavej StephanTLavavej self-assigned this Dec 17, 2024
@StephanTLavavej StephanTLavavej added the regex meow is a substring of homeowner label Jan 8, 2025
@StephanTLavavej
Copy link
Member

Thanks as always for the careful analysis, PR description, and changes! 😻 I pushed small stylistic changes.

@StephanTLavavej StephanTLavavej removed their assignment Jan 13, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 13, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 247c51f into microsoft:main Jan 14, 2025
39 checks passed
@StephanTLavavej
Copy link
Member

✅ 🐱 🐈

@muellerj2 muellerj2 deleted the regex-correct-dot-interpretation branch January 14, 2025 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working regex meow is a substring of homeowner

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants