KEMBAR78
Implement LoadPairVector64 and LoadPairVector128 by echesakov · Pull Request #64864 · dotnet/runtime · GitHub
Skip to content

Conversation

@echesakov
Copy link
Contributor

@echesakov echesakov commented Feb 6, 2022

Resolves #39243

@echesakov echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Feb 6, 2022
@echesakov echesakov self-assigned this Feb 6, 2022
@ghost
Copy link

ghost commented Feb 6, 2022

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Feb 6, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: echesakovMSFT
Assignees: echesakovMSFT
Labels:

arch-arm64, area-CodeGen-coreclr

Milestone: -

…elpers.cs src/tests/JIT/HardwareIntrinsics/Arm/Shared/Helpers.tt
…mStructVal() to allow intrinsics returning a struct in importer.cpp
…alues in multiple registers in lsra.h lsraarm64.cpp lsraxarch.cpp
@echesakov
Copy link
Contributor Author

@dotnet/jit-contrib @tannergooding PTAL

unreached();
}
#elif defined(TARGET_XARCH)
return 2;
Copy link
Member

@tannergooding tannergooding Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to explicitly cover the intrinsic IDs for xarch as well (definitely could be a separate PR).

(that is switch (intrinsicId) with a default: unreached())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, given that none are supported currently on X86 - I can replace this with unreached() and update with the switch when working on MultiplyNoFlags2 (or whatever name we decided)

case NI_AdvSimd_Arm64_LoadPairVector64:
case NI_AdvSimd_Arm64_LoadPairVector64NonTemporal:
case NI_AdvSimd_Arm64_LoadPairVector128:
case NI_AdvSimd_Arm64_LoadPairVector128NonTemporal:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably see later in the review, but under what conditions are these containable?

I didn't think Arm64 really had ins reg, [mem] operations outside atomic instructions; particularly for non-temporal operations...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't see where the containment handling is happening for this. I don't see any changes to lowering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I marked them by mistake when experimenting with some ideas I had. Let me update this.

if (OperIsHWIntrinsic())
{
return (TypeGet() == TYP_STRUCT);
return TypeIs(TYP_STRUCT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how well this will hold in the future? That is, are we expecting things that return TYP_STRUCT to always be multi-reg?

There are cases like say System.Half where we'll eventually need to add support and we'll need to treat it as TYP_HALF or recognize it some other way if we don't want it to be an issue.

I wonder if we should track this as a flag rather than by type just to be safe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like we have a flag. Maybe we should assert it or use it here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can use a flag here.

Comment on lines 8091 to 8115
if (OperIsHWIntrinsic())
{
assert(TypeGet() == TYP_STRUCT);
#ifdef TARGET_ARM64
const GenTreeHWIntrinsic* intrinsic = AsHWIntrinsic();
const NamedIntrinsic intrinsicId = intrinsic->GetHWIntrinsicId();

switch (intrinsicId)
{
// TODO-ARM64-NYI: Support hardware intrinsics operating on multiple contiguous registers.
case NI_AdvSimd_Arm64_LoadPairScalarVector64:
case NI_AdvSimd_Arm64_LoadPairScalarVector64NonTemporal:
case NI_AdvSimd_Arm64_LoadPairVector64:
case NI_AdvSimd_Arm64_LoadPairVector64NonTemporal:
case NI_AdvSimd_Arm64_LoadPairVector128:
case NI_AdvSimd_Arm64_LoadPairVector128NonTemporal:
return 2;

default:
unreached();
}
#elif defined(TARGET_XARCH)
return 2;
}
#endif
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be the same logic as in gentree.cpp above. Should it be factored out into a shared helper or are there conditions where they won't/shouldn't be in sync?

else
{
assert(AsHWIntrinsic()->GetSimdSize() == 8);
return TYP_SIMD8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any cases of "arg size is 8" but "return size is 16" or vice-versa?

I know some instructions fit that bill, I'm not sure if any of the multi-reg cases will

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an example on Arm64 when this wouldn't hold.
ld[1-4] should be similar to ldp.

As for tbl and tbx:

TBX <Vd>.<Ta>, { <Vn>.16B, <Vn+1>.16B, <Vn+2>.16B }, <Vm>.<Ta>

the return value is going to be single-reg but the first source operand is multi-reg and composed of Vector128<byte>.

@tannergooding
Copy link
Member

Changes generally LGTM. Left some comments/questions on a couple bits on how we expect certain checks to work long term.

Would be great to see a couple diffs/codegen examples particularly for the library changes.

@SamMonoRT
Copy link
Member

cc @imhameed @fanyang-mono

@echesakov
Copy link
Contributor Author

Would be great to see a couple diffs/codegen examples particularly for the library changes.

@tannergooding There are suboptimalities in the code using LoadPairVector in the libraries.

To fix that I would need to implement #64863 and #64857.

If I apply these changes on top of this PR the code diffs would look as expected:

GetIndexOfFirstCharToEncodeAdvSimd64

@@ -69,10 +73,8 @@ G_M52035_IG03:
                                                ;; bbWeight=0.50 PerfScore 0.25
 G_M52035_IG04:
             add     x6, x1, x4, LSL #1
-            ld1     {v20.8h}, [x6]
+            ldp     q20, q21, [x6]
             sqxtun  v20.8b, v20.8h
-            add     x6, x6, #16
-            ld1     {v21.8h}, [x6]
             sqxtun2 v20.16b, v21.8h
             and     v21.16b, v20.16b, v16.16b
             tbl     v21.16b, {v19.16b}, v21.16b
@@ -87,7 +89,7 @@ G_M52035_IG04:
             add     x4, x4, #16
             cmp     x4, x5
             blo     G_M52035_IG04
-                                               ;; bbWeight=4    PerfScore 100.00
+                                               ;; bbWeight=4    PerfScore 86.00

GetIndexOfFirstNonAsciiByte_Intrinsified

@@ -208,9 +211,8 @@ G_M18966_IG11:
@@ -208,9 +211,7 @@ G_M18966_IG11:
             sub     x22, x0, #32
                                                ;; bbWeight=0.50 PerfScore 1.75
 G_M18966_IG12:
-            ld1     {v16.16b}, [x19]
-            add     x0, x19, #16
-            ld1     {v10.16b}, [x0]
+            ldp     q16, q9, [x19]
             sshr    v16.16b, v16.16b, #7
             and     v16.16b, v16.16b, v8.16b
             addp    v16.16b, v16.16b, v16.16b

Here is my plan:

  1. Undo the changes to these methods
  2. Merge this PR as is
  3. Finish the above-mentioned two PRs
  4. Follow-up with the libraries changes

I will keep the changes to BitArray:CopyTo though

@@ -413,18 +414,16 @@ G_M40488_IG18:
             zip1    v19.16b, v18.16b, v18.16b
             and     v19.16b, v19.16b, v17.16b
             umin    v19.16b, v19.16b, v16.16b
-            st1     {v19.16b}, [x3]
             zip2    v18.16b, v18.16b, v18.16b
             and     v18.16b, v18.16b, v17.16b
             umin    v18.16b, v18.16b, v16.16b
-            add     x3, x3, #16
-            st1     {v18.16b}, [x3]
+            stp     q19, q18, [x3]
             add     w1, w1, #32
             add     w3, w1, #32
             ldr     w4, [x19,#16]
             cmp     w3, w4
             bls     G_M40488_IG18

@echesakov
Copy link
Contributor Author

Can someone on Mono team to sign off on the relevant changes, please?

For context: 15e56a0 was implemented by @imhameed during my initial attempt to get these changes in #52424

@echesakov
Copy link
Contributor Author

@dotnet/jit-contrib PTAL

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@echesakov echesakov merged commit 7814396 into dotnet:main Feb 10, 2022
@echesakov echesakov deleted the Arm64-ASIMD-LoadPairVector64-LoadPairVector128 branch February 10, 2022 19:51
@imhameed
Copy link
Contributor

Can someone on Mono team to sign off on the relevant changes, please?

For context: 15e56a0 was implemented by @imhameed during my initial attempt to get these changes in #52424

I'm 23 hours late to this but: the changes still look good to me (although I don't know if me signing off on my own implementation is kosher)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Arm64] LoadPairVector64 and LoadPairVector128

5 participants