-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New ASCII APIs #75012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New ASCII APIs #75012
Conversation
# Conflicts: # src/libraries/System.Private.CoreLib/src/System/Globalization/TextInfo.cs # src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs # src/libraries/System.Private.Uri/src/System/UriHelper.cs
@stephentoub the PR is ready, is there any chance you could re-review it? |
private struct ToUpperConversion { } | ||
private struct ToLowerConversion { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be used static abstract interfaces instead?
src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs
Outdated
Show resolved
Hide resolved
} | ||
} | ||
OperationStatus conversionStatus = Ascii.FromUtf16(new ReadOnlySpan<char>(pManaged, length), new Span<byte>(pNative, length), out _); | ||
Debug.Assert(conversionStatus == OperationStatus.Done); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but as a potential follow-up, if it'll be common for pNative to be non-NULL and if data suggests it'll be long enough on average, it might be worth trying a fast-path that just does FromUtf16 without first doing IsValid.
src/libraries/System.Net.HttpListener/src/System/Net/HttpListener.cs
Outdated
Show resolved
Hide resolved
if (!(fallback is EncoderReplacementFallback replacementFallback | ||
&& replacementFallback.MaxCharCount == 1 | ||
&& replacementFallback.DefaultString[0] <= 0x7F)) | ||
&& Ascii.IsValid(replacementFallback.DefaultString[0]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we thinking about char.IsAscii(char)
vs Ascii.IsValid(char)
?
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Char.cs,33d30f343eda0003,references
Do we need the new char
/byte
methods at all? Should we have Array.IsValid(int value)
instead of both of the char
/byte
overloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ascii.IsValid(char)
was added mostly for completeness and having everything in once place.
Should we have Array.IsValid(int value)
I assume you meant Ascii.IsValid(int)
? On the one hand it would be more flexible (users could pass integers as inputs too), on the other I am not sure about codegen differences and whether everyone knows that they can pass byte/char to int-accepting method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ascii.IsValid(char) was added mostly for completeness and having everything in once place.
So which should we be using? Should we change all use of char.IsAscii to be Ascii.IsValid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please no :) Personally I find char.IsAscii
to be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then should we replace all Ascii.IsValid(char) with char.IsAscii?
We "just" added char.IsAscii in .NET 6. Now in .NET 8 we're adding an identical Ascii.IsValid(char). I'm trying to rationalize why we have both. I get the consistency argument, but I don't think it's worth consistency just to have a method no one uses.
Hence my question about whether we should just have the one Ascii.IsValid(int)
, or maybe both IsValid(int)
and IsValid(uint)
. Would there be performance benefits / cons to that? We do have cases today where we check {u}ints for < 0x80, e.g.
runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/IdnMapping.cs
Lines 529 to 531 in 2072f16
private static bool Basic(uint cp) => | |
// Is it in ASCII range? | |
cp < 0x80; |
public static bool IsAsciiCodePoint(uint value) => value <= 0x7Fu; |
runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cs
Lines 647 to 648 in 57bfe47
uint firstByte = pInputBuffer[0]; | |
if (firstByte <= 0x7Fu) |
runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cs
Line 1423 in 57bfe47
if (thisChar <= 0x7Fu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik, opinions?
where TTo : unmanaged, IBinaryInteger<TTo> | ||
where TCasing : struct | ||
{ | ||
if (MemoryMarshal.AsBytes(source).Overlaps(MemoryMarshal.AsBytes(destination))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the casts to bytes necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFROM
and TTO
can be different (char and byte for example) and in theory someone could do sth like this:
runtime/src/libraries/System.Text.Encoding/tests/Ascii/CaseConversionTests.cs
Lines 28 to 29 in c0f38d1
Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer), out _)); | |
Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer).Slice(1, 3), out _)); |
Since everything can be casted to bytes I decided to use it and cover all possible scenarios, but to be honest I did it mostly to cover all possible unit testing scenarios rather than expecting someone to actually do it.
uint elementValue = uint.CreateTruncating(value[start]); | ||
if ((elementValue > 0x20) || ((TrimMask & (1u << ((int)elementValue - 1))) == 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make this branch-free with:
uint c = (ushort)(uint.CreateTruncating(value) - '\t');
if ((int)((0xF8000100U << (short)c) & (c - 32)) >= 0)
See
runtime/src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Lines 4599 to 4628 in 2480f01
// Next, handle sets where the high - low + 1 range is <= 32. In that case, we can emit | |
// a branchless lookup in a uint that does not rely on loading any objects (e.g. the string-based | |
// lookup we use later). This nicely handles common sets like [\t\r\n ]. | |
if (analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 32) | |
{ | |
additionalDeclarations.Add("uint charMinusLowUInt32;"); | |
// Create the 32-bit value with 1s at indices corresponding to every character in the set, | |
// where the bit is computed to be the char value minus the lower bound starting from | |
// most significant bit downwards. | |
bool negatedClass = RegexCharClass.IsNegated(charClass); | |
uint bitmap = 0; | |
for (int i = analysis.LowerBoundInclusiveIfOnlyRanges; i < analysis.UpperBoundExclusiveIfOnlyRanges; i++) | |
{ | |
if (RegexCharClass.CharInClass((char)i, charClass) ^ negatedClass) | |
{ | |
bitmap |= 1u << (31 - (i - analysis.LowerBoundInclusiveIfOnlyRanges)); | |
} | |
} | |
// To determine whether a character is in the set, we subtract the lowest char; this subtraction happens before the result is | |
// zero-extended to uint, meaning that `charMinusLowUInt32` will always have upper 16 bits equal to 0. | |
// We then left shift the constant with this offset, and apply a bitmask that has the highest | |
// bit set (the sign bit) if and only if `chExpr` is in the [low, low + 32) range. | |
// Then we only need to check whether this final result is less than 0: this will only be | |
// the case if both `charMinusLowUInt32` was in fact the index of a set bit in the constant, and also | |
// `chExpr` was in the allowed range (this ensures that false positive bit shifts are ignored). | |
negate ^= negatedClass; | |
return $"((int)((0x{bitmap:X}U << (short)(charMinusLowUInt32 = (ushort)({chExpr} - {Literal((char)analysis.LowerBoundInclusiveIfOnlyRanges)}))) & (charMinusLowUInt32 - 32)) {(negate ? ">=" : "<")} 0)"; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible I would prefer to avoid adding further optimizations now, merge the API, add benchmarks to dotnet/performance and then let others tune it.
src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs
Outdated
Show resolved
Hide resolved
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this mostly through the lens of how the new methods are used--each usage is a nice improvement in maintainability. I also appreciate the copious comments through these implementations and tests.
More regressions: dotnet/perf-autofiling-issues#11194 |
mono interpreter regression dotnet/perf-autofiling-issues#11147 |
It seems this PR still has regressions opened, e.g. these 5 (we don't have a lot of coverage for ToLower so only 5) regressed if you open "full history" tab: dotnet/perf-autofiling-issues#10226 My guess that it's because the SIMD part to change case was a bit more efficient in #78262, e.g.: static Vector128<sbyte> ToLower(Vector128<sbyte> src)
{
var lowInd = Vector128.Create((sbyte)63) + src;
var combInd = Vector128.LessThan(Vector128.Create((sbyte)-103), lowInd);
return Vector128.AndNot(Vector128.Create((sbyte)0x20), combInd) + src;
} |
@EgorBo we have an issue that is tracking it: #80245
The new methods support 4 different combinations of input and output: byte->byte, byte->char, char->char and char->byte. The generic code is most likely not as optimal as it was for char->char implementation we had: runtime/src/libraries/System.Private.CoreLib/src/System/Text/Ascii.CaseConversion.cs Lines 250 to 260 in 1c442fc
|
fixes #28230