KEMBAR78
Added XML declaration check & `Source#skip_spaces` method by naitoh · Pull Request #282 · ruby/rexml · GitHub
Skip to content

Conversation

@naitoh
Copy link
Contributor

@naitoh naitoh commented Aug 11, 2025

Why?

Added XML declaration check

  • The version attribute is required in XML declaration.
  • Only version attribute, encoding attribute, and standalone attribute are allowed in XML declaration.
  • XML declaration is only allowed once.

See: https://www.w3.org/TR/xml/#NT-XMLDecl

Added Source#skip_spaces method

In the case of @source.match?(/\s+/um, true), if there are no spaces at the beginning, I want to stop reading immediately.
However, it continues to read the buffer until it finds a match, but it never finds a match.
As a result, it continues reading until the end of the file.

In the case of large XML files, drop_parsed_content occur frequently until the buffer is cleared, which may affect performance.

Benchmark

                         before       after  before(YJIT)  after(YJIT) 
                 dom     32.534      35.130        54.559       53.528 i/s -     100.000 times in 3.073715s 2.846540s 1.832883s 1.868189s
                 sax     44.785      44.089        78.303       77.842 i/s -     100.000 times in 2.232907s 2.268138s 1.277093s 1.284657s
                pull     51.750      51.105        90.819       90.658 i/s -     100.000 times in 1.932351s 1.956759s 1.101094s 1.103050s
              stream     51.427      51.444        89.820       88.971 i/s -     100.000 times in 1.944502s 1.943855s 1.113340s 1.123960s

Comparison:
                              dom
        before(YJIT):        54.6 i/s 
         after(YJIT):        53.5 i/s - 1.02x  slower
               after:        35.1 i/s - 1.55x  slower
              before:        32.5 i/s - 1.68x  slower

                              sax
        before(YJIT):        78.3 i/s 
         after(YJIT):        77.8 i/s - 1.01x  slower
              before:        44.8 i/s - 1.75x  slower
               after:        44.1 i/s - 1.78x  slower

                             pull
        before(YJIT):        90.8 i/s 
         after(YJIT):        90.7 i/s - 1.00x  slower
              before:        51.8 i/s - 1.75x  slower
               after:        51.1 i/s - 1.78x  slower

                           stream
        before(YJIT):        89.8 i/s 
         after(YJIT):        89.0 i/s - 1.01x  slower
               after:        51.4 i/s - 1.75x  slower
              before:        51.4 i/s - 1.75x  slower
  • YJIT=ON : 0.98x - 1.00x faster
  • YJIT=OFF : 0.98x - 1.07x faster

@naitoh naitoh marked this pull request as ready for review August 11, 2025 14:23
@naitoh naitoh requested a review from kou August 11, 2025 14:23
@kou
Copy link
Member

kou commented Aug 12, 2025

IOSource#match? read_keep param added

In the case of @source.match?(/\s+/um, true), if there are no spaces at the beginning, I want to stop reading immediately. However, it continues to read the buffer until it finds a match, but it never finds a match. As a result, it continues reading until the end of the file.

In the case of large XML files, drop_parsed_content occur frequently until the buffer is cleared, which may affect performance.

How about adding Source#skip_spaces or something instead of adding a new argument to Source#match??
I feel that adding read_keep parameter reduces readability...

@naitoh naitoh force-pushed the check_XMLDecl branch 2 times, most recently from 3bab763 to 9ededdc Compare August 12, 2025 15:00
@naitoh naitoh changed the title Added XML declaration check & IOSource#match? read_keep param added Added XML declaration check & Source#?skip_spaces method Aug 12, 2025
@naitoh naitoh changed the title Added XML declaration check & Source#?skip_spaces method Added XML declaration check & Source#skip_spaces? method Aug 12, 2025
@naitoh
Copy link
Contributor Author

naitoh commented Aug 12, 2025

I modified it to use Source#skip_spaces?.

@naitoh naitoh changed the title Added XML declaration check & Source#skip_spaces? method Added XML declaration check & Source#skip_spaces? method Aug 12, 2025
@naitoh naitoh changed the title Added XML declaration check & Source#skip_spaces? method Added XML declaration check & Source#skip_spaces method Aug 13, 2025
@naitoh naitoh requested a review from kou August 13, 2025 12:56
end
name = parse_name(base_error_message)
@source.match?(/\s*/um, true) # skip spaces
@source.match?(Private::CUT_SPACES, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use @source.skip_spaces here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it is cool!
Thanks!

naitoh added 2 commits August 14, 2025 21:54
## Why?
In the case of `@source.match?(/\s+/um, true)`, if there are no spaces at the beginning, I want to stop reading immediately.
However, it continues to read the buffer until it finds a match, but it never finds a match.
As a result, it continues reading until the end of the file.

In the case of large XML files, drop_parsed_content occur frequently until the buffer is cleared, which may affect performance.
## Why?

- The version attribute is required in XML declaration.
- Only version attribute, encoding attribute, and standalone attribute are allowed in XML declaration.
- XML declaration is only allowed once.

See: https://www.w3.org/TR/xml/#NT-XMLDecl
@naitoh naitoh requested review from kou and tompng August 14, 2025 13:00
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@naitoh
Copy link
Contributor Author

naitoh commented Aug 22, 2025

Thank you for your reviews.

@naitoh naitoh merged commit 5859bde into ruby:master Aug 22, 2025
66 of 67 checks passed
@naitoh naitoh deleted the check_XMLDecl branch August 22, 2025 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants