KEMBAR78
Expose IUtf8SpanParsable and implement it on the primitive numeric types by tannergooding · Pull Request #86875 · dotnet/runtime · GitHub
Skip to content

Conversation

tannergooding
Copy link
Member

This makes progress towards #81500

Still not covered are:

  • BigInteger
  • Boolean
  • Complex
  • Char (explicit API surface to avoid boxing)
  • DateOnly
  • DateTime
  • DateTimeOffset
  • Enum
  • Guid
  • IPAddress
  • IPNetwork
  • TimeOnly
  • TimeSpan
  • Version

@ghost
Copy link

ghost commented May 29, 2023

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost assigned tannergooding May 29, 2023
@ghost
Copy link

ghost commented May 29, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This makes progress towards #81500

Still not covered are:

  • BigInteger
  • Boolean
  • Complex
  • Char (explicit API surface to avoid boxing)
  • DateOnly
  • DateTime
  • DateTimeOffset
  • Enum
  • Guid
  • IPAddress
  • IPNetwork
  • TimeOnly
  • TimeSpan
  • Version
Author: tannergooding
Assignees: tannergooding
Labels:

area-System.Numerics, new-api-needs-documentation

Milestone: -

@tannergooding
Copy link
Member Author

summary:
better: 6, geomean: 1.067
total diff: 6

No Slower results for the provided threshold = 2% and noise filter = 0.3 ns.

| Faster                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Tests.Perf_Int64.Parse(value: "9223372036854775807")      |      1.09 |            12.27 |            11.30 |         |
| System.Tests.Perf_Int64.TryParse(value: "-9223372036854775808")  |      1.08 |            12.40 |            11.45 |         |
| System.Tests.Perf_UInt64.Parse(value: "18446744073709551615")    |      1.08 |            12.62 |            11.73 |         |
| System.Tests.Perf_UInt64.TryParse(value: "18446744073709551615") |      1.07 |            12.04 |            11.23 |         |
| System.Tests.Perf_Int64.TryParse(value: "9223372036854775807")   |      1.05 |            11.80 |            11.20 |         |
| System.Tests.Perf_Int64.Parse(value: "-9223372036854775808")     |      1.03 |            12.20 |            11.83 |         |

@tannergooding
Copy link
Member Author

WASM Interpreter reports

No differences found between the benchmark results with threshold 0%.

@tannergooding
Copy link
Member Author

CC. @stephentoub

if (throwOnFailure)
{
Number.ThrowOverflowException(Type.GetTypeCode(typeof(TUnderlying)));
ThrowHelper.ThrowOverflowException();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 959 above was modified to read Number.ThrowOverflowException<TUnderlying>();, but this line was modified to remove the type code / name entirely. Was this intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying here was only hitting the "rare types" (float, double, nint, nuint, or char). None of these were handled by Number.ThrowOverflowException(TypeCode) and would have triggered asserts in debugging (since they aren't decimal) and using an incorrect string otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly throwing OverflowException without a string is both simpler and more accurate than what we had before.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool Vector128OrdinalIgnoreCaseAscii(Vector128<byte> vec1, Vector128<byte> vec2)
{
// ASSUMPTION: Caller has validated that input values are ASCII.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible more optimized (completely untested!) implementation, based solely on codegen size.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool Vector128OrdinalIgnoreCaseAscii(Vector128<byte> vec1, Vector128<byte> vec2)
{
    // ASSUMPTION: Caller has validated that input values are ASCII.

    Vector128<sbyte> vector0x20 = Vector128.Create((sbyte)0x20);

    // Convert all characters to lowercase.
    // Some non-letter chars will also be changed; we'll deal with them later.

    Vector128<sbyte> asLower = vec1.AsSByte() | vector0x20;

    // Create a vector where all letter characters [a-z] are normalized to 0x20
    // and all non-letter characters are normalized to 0x00.

    Vector128<sbyte> letterChars = Vector128.GreaterThan(asLower + Vector128.Create((sbyte)(0x7F - 'z')), Vector128.Create((sbyte)(0x7F - 26))) & vector0x20;

    // Compute the ones-complement diff between vec1 and vec2.

    Vector128<sbyte> onesCompDiff = vec1.AsSByte() ^ vec2.AsSByte();

    // There must be no bits set in 'onesCompDiff' that aren't also set in 'letterChars'.

    return Vector128.AndNot(letterChars, onesCompDiff) == default;
}

Copy link
Member Author

@tannergooding tannergooding Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not going to worry about optimizing further at this point. Its more important that we get the feature in for .NET 8.

As is, this generally matches the algorithm used by the UTF-16 path and is correct.

An interested party can always submit this as an optimization after the main support goes in.

utf16Text = utf16TextArray.AsSpan(0, textMaxCharCount);
}

int utf16TextLength = Encoding.UTF8.GetChars(utf8Text, utf16Text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential DoS vector here. This logic cannot safely coexist with approved API #87171. It's reasonable that somebody might write code like this:

while (!utf8Span.IsEmpty)
{
    TNumber parsed = TNumber.Parse(utf8Span, NumberStyles.AllowTrailingInvalidCharacters, out int bytesConsumed);
    UseValue(parsed);
    utf8Span = utf8Span.Slice(bytesConsumed).TrimStart((byte)','); // or whatever the expected delimiter char is
}

The code above appears to have O(n) runtime, where n is the length (in bytes) of the input span. However, since there's a transcoding step taking place under the covers, and since the transcoding step consumes the entire remainder of the input span, the actual runtime is O(n^2). Since aspnet allows a 4MB request by default, this means a loop like this can result in <some constant factor> * 16 trillion units of work being performed.

(Same comment elsewhere in this file where similar code appears.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use Utf8.ToUtf16(utf8Text, utf16Text, out _, out int utf16TextLength, replaceInvalidSequences: false), as with other paths. It throws FormatException on Parse and returns false for TryParse, indicating the input was not in a correct format.

utf16Text = utf16TextArray.AsSpan(0, textMaxCharCount);
}

int utf16TextLength = Encoding.UTF8.GetChars(utf8Text, utf16Text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential correctness issue here. What behavior should this routine have when the input is not well-formed UTF-8? The current behavior here is: lossily replace invalid UTF-8 sequences with well-formed UTF-16 replacement characters, then call the UTF-16 parse routine. Does this change the desired correctness of the parse routine? What about if the replacement character has special meaning to the specified provider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use Utf8.ToUtf16(utf8Text, utf16Text, out _, out int utf16TextLength, replaceInvalidSequences: false), as with other paths. It throws FormatException on Parse and returns false for TryParse, indicating the input was not in a correct format.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any dev that needs to override the behavior can and should by overriding in their derived type.


if (sourceStatus != OperationStatus.Done)
{
return false;
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets a bit complicated. The code as-is definitively returns "not a prefix match!" - even though the inputs are invalid to the point that the question is nonsensical. It's a bit like asking "is 'dog' less than 'cat'?" It's not true and it's not false. It's just... huh?

Throwing a wrench into this even more is that ICU actually does have specialized handling for invalid UTF-16 sequences. (Basically, it treats them as opaque chars that don't match anything except for themselves.) So this means that, e.g., the sequence "\uD800\uD801".StartsWith("\uD800") would return true, even though both sequences are ill-formed. But in the logic you have here, "\xFF\xFE"u8.CultureAwareStartsWith("\xFF"u8) would return false.

Recommendation: Invalid UTF-8 sequences should throw (not return false) or should utilize existing ICU handling for invalid UTF-16 sequences.

The way I worked around this in the UTF-8 prototype was to use something akin to WTF-16 for the comparison. Basically, every time an invalid UTF-8 byte is encountered, append an invalid UTF-16 code point to the string we're building. Any invalid UTF-8 byte 0x?? would map to the char U+DF??. So for instance, the invalid byte 0x80 would map to the invalid char 0xDF80, the invalid byte 0xC0 would map to the invalid char U+DFC0, etc. The end result is that you're building up something that is mostly UTF-16, with the invalid sequences specifically chosen not to conflict with one another, which allows ICU to handle the input string appropriately.

This does not have the same algorithmic complexity constraints that I call out in IUtf8SpanParsable, since ICU's implementation for culture-aware prefix matching is O(needle.Length + haystack.Length) with no early-exit optimization. It's lamentable, but it is what it is. So that means the transcoding logic is upping some constant factor but isn't otherwise changing the algorithmic complexity itself.

(NLS, I believe, does have an early exit consideration. But that should be enough of an edge case that I'm willing to hold my nose for any algorithmic complexity changes we introduce.)

@tannergooding tannergooding merged commit 1dd0fa2 into dotnet:main Jul 12, 2023
@tannergooding tannergooding deleted the utf8-parsing branch July 12, 2023 05:12
@EgorBo
Copy link
Member

EgorBo commented Jul 18, 2023

Comment on lines +304 to +308
byteOffset += 4;
length -= 4;
}

Debug.Assert(length == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite confused as to how testing was good before and is failing now, but this snippet looks suspicious. The length was 1-3, and byteOffset (though not used after this point) also would be increased by 1-3 (except for the 2 sometimes already added above)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants