- 
                Notifications
    You must be signed in to change notification settings 
- Fork 5.2k
Optimize Ascii.Equals when widening #87141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return false; | ||
| } | ||
|  | ||
| (Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is me adding value:
| (Vector256<ushort> lower, Vector256<ushort> upper) = Vector256.Widen(leftNotWidened); | |
| var (lower, upper) = Vector256.Widen(leftNotWidened); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is me adding value:
Please stop adding value. :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BCL has a rule that var can only be used when the type is apparent.
It's a battle that I lost before ever joining the team 😅
        
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      |  | ||
| Vector256<TRight> leftValues; | ||
| Vector256<TRight> rightValues; | ||
| ref TRight oneVectorAwayFromRightEnd = ref Unsafe.Add(ref currentRightSearchSpace, length - (uint)Vector256<TLeft>.Count); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding why this is valid. We're subtracting from the "right" search space the number of "left" elements in a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works because TLeft and TRight are either the same type, or we are in the widen case where Vector<TLeft> is twice the size of Vector<TRight> and the widen code will advance twice Vector<TRight>.Count which is equal to 1 Vector<TLeft>.Count.
But it is written in a confusing way. The whole TLoader abstraction helps with code sharing but makes this part kind of yucky. Maybe if the compare method advanced the pointers it would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. At a minimum a comment explaining would be helpful.
        
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | if (!Vector256<ushort>.AllBitsSet.Equals( | ||
| Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1)))) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should prefer the operators where possible:
| if (!Vector256<ushort>.AllBitsSet.Equals( | |
| Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1)))) | |
| if (Vector256<ushort>.AllBitsSet != (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1))) | 
bool Equals() in particular is the same as == for integral types, but isn't directly recognized as intrinsic and is an instance methjod. So the JIT has to inline and elide the reference taken for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to just make this return Vector256<ushort>.AllBitsSet == (Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1)) instead as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at the "Side-note on weird codegen observed" section at the bottom of the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be == if you noticed a suboptimal codegen - we'd better look and fix it in JIT instead of complicating C# code just to squeeze everything here and now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be as simple as:
return ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sometimes it can be fixed with return cond ? true : false;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
Vector256<ushort> equals = Vector256.Equals(lower, rightValues0) & Vector256.Equals(upper, rightValues1);
if (equals.AsByte().ExtractMostSignificantBits() != 0xffffffffu)
    return false;
return true;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
Not needed, ((lower ^ rightValues0) | (upper ^ rightValues1)) == Vector256.Zero; is just a canonical way to do that. Also, ExtractMostSignificantBits is very expensive on arm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo Agreed, VPMOVMSKB has no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by [CompExactlyDependsOn(typeof(Avx))].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EgorBo Agreed,
VPMOVMSKBhas no equivalent on ARM64 (see #87141 (comment)), but this method is guarded by[CompExactlyDependsOn(typeof(Avx))].
still, == Vector.Zero is expected to be lowered to MoveMask with SSE2 or to vptest with SSE41/AVX so no reason to do it by hands
| Tagging subscribers to this area: @dotnet/area-system-text-encoding Issue DetailsWhile trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that  
 
 Code-gen for the different if conditions
 
 Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR. But opening the PR now as draft to get feedback on the overall approach before expanding it. We can likely get similar gains in the  Side-note on weird codegen observedWhen doing  if condition: no if, return directly 
 | 
| See also for ARM64: Bit twiddling with Arm Neon: beating SSE movemasks, counting bits and more. | 
| rightValues = Vector256.LoadUnsafe(ref currentRightSearchSpace); | ||
|  | ||
| if (leftValues != rightValues || !AllCharsInVectorAreAscii(leftValues | rightValues)) | ||
| if (!TLoader.Compare256(ref currentLeftSearchSpace, ref currentRightSearchSpace)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLoader does now the comparison, so the name should be adjusted to reflect that.
| (Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr)); | ||
| return Vector256.Create(lower, upper); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (Vector128<ushort> lower, Vector128<ushort> upper) = Vector128.Widen(Vector128.LoadUnsafe(ref ptr)); | |
| return Vector256.Create(lower, upper); | |
| return Vector256.WidenLower(Vector128.LoadUnsafe(ref ptr).ToVector256Unsafe()); | 
This results in better codegen when Avx2 is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggests a set of missing System.Runtime.Intrinsics.Vector256 APIs:
public static System.Runtime.Intrinsics.Vector256<ushort> Widen (System.Runtime.Intrinsics.Vector128<byte> source); public static System.Runtime.Intrinsics.Vector256<ushort> LoadWideningUnsafe (ref byte source);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in better codegen when
Avx2is available.
Codegen on arm64 is pretty bad though, probably should wrap with Avx2.IsSupported.
| @BrennanConroy, when this lands, will that be enough to enable ASP.NET to switch away from its custom implementation? If so, @adamsitnik, can you help ensure this lands as soon as possible? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes LGTM, big thanks for your contribution @BrennanConroy.
I've left some comments, but they are all subjective and related only to naming. PTAL at them and either reject or apply them and mark the PR as ready for review. Then I am simply going to merge it.
        
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Equality.cs
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BrennanConroy is OOF so I applied my suggestions and added a test for BoundedMemory
Overall for cases where the inputs are equal, the perf has improved and is even faster than the ASP.NET implementation that we want to remove in dotnet/aspnetcore#48368
For cases where the inputs are not equal at first character, the perf has regressed, but it's on par with the ASP.NET implementation. It's acceptable if we want to unblock dotnet/aspnetcore#48368
Source code, results:
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1848)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=8.0.100-preview.4.23259.14
  [Host]     : .NET 8.0.0 (8.0.23.25905), X64 RyuJIT AVX2
  Job-SJHYUI : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-GSESEP : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
LaunchCount=3 MemoryRandomization=True  | Method | Job | Size | Equal | Mean | Ratio | 
|---|---|---|---|---|---|
| SystemAscii | PR | 6 | False | 1.653 ns | 1.00 | 
| AspNet | PR | 6 | False | 3.335 ns | 2.03 | 
| SystemAscii | main | 6 | False | 1.646 ns | 1.00 | 
| SystemAscii | PR | 6 | True | 4.446 ns | 1.06 | 
| AspNet | PR | 6 | True | 4.180 ns | 0.99 | 
| SystemAscii | main | 6 | True | 4.202 ns | 1.00 | 
| SystemAscii | PR | 32 | False | 2.837 ns | 1.36 | 
| AspNet | PR | 32 | False | 2.814 ns | 1.35 | 
| SystemAscii | main | 32 | False | 2.080 ns | 1.00 | 
| SystemAscii | PR | 32 | True | 2.888 ns | 0.86 | 
| AspNet | PR | 32 | True | 3.023 ns | 0.90 | 
| SystemAscii | main | 32 | True | 3.356 ns | 1.00 | 
| SystemAscii | PR | 64 | False | 2.821 ns | 1.36 | 
| AspNet | PR | 64 | False | 2.796 ns | 1.35 | 
| SystemAscii | main | 64 | False | 2.072 ns | 1.00 | 
| SystemAscii | PR | 64 | True | 3.849 ns | 0.73 | 
| AspNet | PR | 64 | True | 4.446 ns | 0.84 | 
| SystemAscii | main | 64 | True | 5.277 ns | 1.00 | 
the CI failure is unrelated (#73040)
While trying to replace some custom unsafe code in Kestrel in dotnet/aspnetcore#48368, we noticed that
Ascii.Equalsis slower than Kestrel's hand-rolled code. Upon investigation, 3 changes were found that could improve the performance.vpor ymm0,ymm0,ymm1stringto abyte[]the bytes were widened which results in 2 vectors that are half the size of the original vector. The way the code was written made it so we would fallback toVector128comparisons in theVector256case, andVector64in theVector128case. We can refactor the code to not fallback to smaller vector sizes resulting in half the loop iterations needed for the same input.if (lower != rightValues0 || upper != rightValues1)if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))Code-gen for the different if conditions
if (lower != rightValues0 || upper != rightValues1)if (!Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1))))Much faster in the byte+char comparison case, the other two are likely improved due to removing the OR.
We can likely get similar gains in the
EqualsIgnoreCase, and of course can be expanded to theVector128paths. But opening the PR now as draft to get feedback on the overall approach before expanding it.Side-note on weird codegen observed
When doing
return Vector256<ushort>.AllBitsSet.Equals(Vector256.BitwiseAnd(Vector256.Equals(lower, rightValues0), Vector256.Equals(upper, rightValues1)))instead of the if condition it looks like extra instructions are generatedif condition:
no if, return directly