KEMBAR78
Move `OsStr::slice_encoded_bytes` validation to platform modules by blyxxyz · Pull Request #118569 · rust-lang/rust · GitHub
Skip to content

Conversation

@blyxxyz
Copy link
Contributor

@blyxxyz blyxxyz commented Dec 3, 2023

This delegates OS string slicing (OsStr::slice_encoded_bytes) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: #118485

cc @epage, @BurntSushi

@rustbot
Copy link
Collaborator

rustbot commented Dec 3, 2023

r? @Mark-Simulacrum

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added O-unix Operating system: Unix-like O-wasm Target: WASM (WebAssembly), http://webassembly.org/ O-windows Operating system: Windows S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 3, 2023
@blyxxyz blyxxyz mentioned this pull request Dec 3, 2023
4 tasks
@epage
Copy link
Contributor

epage commented Dec 4, 2023

imo this doesn't feel worth it unless we do move in the direction of relaxing bounds. This creates more code paths that are effectively meant to be in-sync, creating more of burden for people trying to understand the code or maintaining it. In my mind to justify this, we'd need benchmarks showing the improvement and would it would have to be compelling enough.

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Dec 4, 2023

I felt that the general wasm improvement and better error messages tipped it over the edge. But I'm also OK with shelving it.

@Mark-Simulacrum
Copy link
Member

I don't have a strong opinion either, but I'm not very familiar with this code or the direction we're headed in here. I think hearing @BurntSushi's thoughts or someone who is actively maintaining this area could convince me either way.

When you say "increased performance" -- is that a difference between "did nothing" vs. "have to scan the string", or more minor, like "compared 10 bytes" vs. "compared 15 bytes"?

@blyxxyz
Copy link
Contributor Author

blyxxyz commented Dec 9, 2023

On wasm it's the difference between "do nothing" and "scan the whole string". (Not for this operation, which was an expensive O(1) and becomes checking a single byte, but for to_str, to_string_lossy, etcetera.) It's useful independently of the slicing feature. OsStr/Path may not be commonly used on wasm though, so it'd be helpful to have a reality check from someone who's more familiar with it (including use cases like Cloudflare Workers).

Maybe unsupported is used by platforms besides wasm? They could also benefit.

On Windows it's the difference between "call a full UTF-8 validation function to check a short substring" and "check 1 or 2 bytes" if you're splitting in the middle of non-ASCII text, which might be uncommon or might be vanishingly rare depending on how the method gets used in the wild.

@bors
Copy link
Collaborator

bors commented Jan 13, 2024

☔ The latest upstream changes (presumably #117285) made this pull request unmergeable. Please resolve the merge conflicts.

@blyxxyz blyxxyz force-pushed the platform-os-str-slice branch from 694a789 to 7504473 Compare January 13, 2024 18:19
check_valid_boundary(encoded_bytes, start);
check_valid_boundary(encoded_bytes, end);

self.inner.check_public_boundary(start);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this a public boundary? Is there a private boundary?

I think it would be good to add some docs here if we're adding an extension point -- perhaps a couple lines of common describing what the function should (roughly) do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public boundaries are where we let users split without panicking. The private boundaries would depend on the safety invariants for the internal encoding. For example if you split in the middle of a WTF-8 codepoint you can cause out-of-bounds reads, so that's neither a public nor a private boundary. But if you split between surrogate codepoints then that's fine as far as the implementation is concerned, we just don't allow users to do that, so that's a private boundary but not a public boundary.

I've added a comment, good call.

None => false,
Some(&b) => b < 128 || b >= 192,
None => index == slice.len(),
Some(&b) => (b as i8) >= -0x40,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this implementation change?

(In particular it seems like the behavior is no longer an unconditional true for index = 0, and also doesn't correspond with the str::is_char_boundary impl?)

The (b as i8) >= -0x40 is probably clearer as b.is_utf8_char_boundary().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to the str::is_char_boundary impl. The point was to bring it in line with that and to be consistent with the new function.

I can't use b.is_utf8_char_boundary() because it's private to core. (Unless there are workarounds for that?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it is. We should think about exposing that as a public API, it seems consistent with the is_ascii_ functions we already expose that have similar bit-twiddling internally.

/// implementation detail.
#[track_caller]
#[inline]
pub fn check_utf8_boundary(slice: &Wtf8, index: usize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused how this is different from the is_code_point_boundary (from just the method name/comments)?

str::is_char_boundary is documented as "is the first byte in a UTF-8 code point sequence or the end of the string", which sounds very similar to "at the edge of either a valid UTF-8 codepoint or of the whole string". Is this needed separately due to WTF-8 details perhaps?

I'm worried that it'll be easy to use the wrong function so I think some detail on when we should use each in comments would be good. It's also a bit worrying to me that we want a new function since that feels like it implies we're changing behavior rather than just optimizing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm posting a longer explanation below, but the gist of it is that there are some WTF-8 codepoints that are not UTF-8 codepoints. If the string is pure UTF-8 then the boundaries are the same. I've tweaked the comment a little.

It's not a behavioral change, the old function wasn't used for this functionality to begin with. The cases in which this implementation panics should be the same as those in which the old one does.

@Mark-Simulacrum
Copy link
Member

Apologies for the delay in reviewing.

On Windows it's the difference between "call a full UTF-8 validation function to check a short substring" and "check 1 or 2 bytes" if you're splitting in the middle of non-ASCII text, which might be uncommon or might be vanishingly rare depending on how the method gets used in the wild.

I didn't look too closely yet, but I guess I'm confused by this framing. Unless this PR is changing what we expose to users (based on the description it doesn't sound like it), these two operations should be basically equivalent, no? I.e., we need to check just as many bytes. It's possible that the check itself can be cheaper if we know the bytes are wtf8...?

In general my current inclination is that this additional complexity (~500 LOC) is not worth it for what seem like either tiny or non-existent gains on Windows. For wasm, my suspicion is that avoiding the utf-8 checks isn't materially impactful to almost any programs, so I definitely don't think it's worth the complexity.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 20, 2024
On Windows and UEFI this improves performance and error messaging.

On other platforms we optimize the fast path a bit more.

This also prepares for later relaxing the checks on certain platforms.
@blyxxyz blyxxyz force-pushed the platform-os-str-slice branch from 7504473 to 51a7396 Compare January 21, 2024 19:09
@blyxxyz
Copy link
Contributor Author

blyxxyz commented Jan 21, 2024

Thanks for the review! No worries about the delay.

It's possible that the check itself can be cheaper if we know the bytes are wtf8...?

Yes, much cheaper. WTF-8 is only a slight superset of UTF-8, so most of the work is already done.

Validating a UTF-8 codepoint sequence looks something like this:

  1. Check that all the byte headers are valid and consistent (e.g. if byte 1 starts with 1110 then bytes 2 and 3 must start with 10)
  2. Check that the shortest possible encoding is used, with no more leading zeros than necessary
  3. Check that the codepoint is at most 0x10FFFF (the highest codepoint)
  4. Check that the codepoint is outside the range 0xD800..=0xDFFF (the surrogate codepoints)

This is somewhat involved even with optimizations. And the API used for the current implementation is designed to validate whole strings starting at the front, so we need to hold it in a weird way to perform the checks we want (validating just a single codepoint, sometimes starting at the back).

But if you already know you have valid UTF-8 it's possible to check a boundary with a single comparison of a single byte.

WTF-8 partially relaxes requirement 4. It's shaped exactly like UTF-8, but it can contain surrogate codepoints. So to satisfy the slice_encoded_bytes() requirements we have to:

  • Perform the same boundary check.
  • Watch out for surrogate codepoints.

This is still very cheap. The wtf8 module already had examples of how to efficiently do the second.

In general my current inclination is that this additional complexity (~500 LOC) is not worth it for what seem like either tiny or non-existent gains on Windows. For wasm, my suspicion is that avoiding the utf-8 checks isn't materially impactful to almost any programs, so I definitely don't think it's worth the complexity.

I've removed the WASM module for this PR. The Windows optimization doesn't require it. A future semantical change may require something like it but if we go through with that we can either bring it back or do it in a way that requires less code.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 22, 2024
@Mark-Simulacrum
Copy link
Member

@bors r+

Thanks for the explanations. That makes sense to me, so let's go ahead and merge this in.

@bors
Copy link
Collaborator

bors commented Feb 18, 2024

📌 Commit 51a7396 has been approved by Mark-Simulacrum

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 18, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 18, 2024
…iaskrgr

Rollup of 7 pull requests

Successful merges:

 - rust-lang#118569 (Move `OsStr::slice_encoded_bytes` validation to platform modules)
 - rust-lang#121067 (make "invalid fragment specifier" translatable)
 - rust-lang#121224 (Remove unnecessary unit binding)
 - rust-lang#121247 (Add help to `hir_analysis_unrecognized_intrinsic_function`)
 - rust-lang#121257 (remove extraneous text from example config)
 - rust-lang#121260 (Remove const_prop.rs)
 - rust-lang#121266 (Add uncontroversial syscall doc aliases to std docs)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 99560a4 into rust-lang:master Feb 18, 2024
@rustbot rustbot added this to the 1.78.0 milestone Feb 18, 2024
@blyxxyz blyxxyz deleted the platform-os-str-slice branch February 18, 2024 20:51
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Feb 18, 2024
Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum

Move `OsStr::slice_encoded_bytes` validation to platform modules

This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: rust-lang#118485

cc `@epage,` `@BurntSushi`
Noratrieb pushed a commit to Noratrieb/rust that referenced this pull request Feb 18, 2024
…iaskrgr

Rollup of 7 pull requests

Successful merges:

 - rust-lang#118569 (Move `OsStr::slice_encoded_bytes` validation to platform modules)
 - rust-lang#121067 (make "invalid fragment specifier" translatable)
 - rust-lang#121224 (Remove unnecessary unit binding)
 - rust-lang#121247 (Add help to `hir_analysis_unrecognized_intrinsic_function`)
 - rust-lang#121257 (remove extraneous text from example config)
 - rust-lang#121260 (Remove const_prop.rs)
 - rust-lang#121266 (Add uncontroversial syscall doc aliases to std docs)

r? `@ghost`
`@rustbot` modify labels: rollup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-unix Operating system: Unix-like O-wasm Target: WASM (WebAssembly), http://webassembly.org/ O-windows Operating system: Windows S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants