KEMBAR78
JIT: Unify arm64 and x64 GT_SELECT handling by jakobbotsch · Pull Request #82610 · dotnet/runtime · GitHub
Skip to content

Conversation

@jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Feb 24, 2023

This unifies GT_SELECT/GT_SELECTCC handling between arm64 and x64. The arm64 backend no longer uses containment for compare chains; instead, there is a new GT_CCMP node that both produces and consumes flags, and lowering can lower GT_AND(op, relop) down to this node.
To do this, the JIT verifies invariance and reorders the cmp/ccmps to happen successively in order. For example, given C# code like:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Foo(int a, int b, int c)
{
    if (a < b & b < c & c < 10)
        return 42;

    return 13;
}

we see LIR:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  1) [000001] -----------                    t1 =    LCL_VAR   int    V01 arg1         u:1 $81
                                                            ┌──▌  t0     int    
                                                            ├──▌  t1     int    
N003 (  6,  3) [000002] -----------                    t2 =   LT        int    $100
N004 (  1,  1) [000003] -----------                    t3 =    LCL_VAR   int    V01 arg1         u:1 (last use) $81
N005 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V02 arg2         u:1 $82
                                                            ┌──▌  t3     int    
                                                            ├──▌  t4     int    
N006 (  6,  3) [000005] -----------                    t5 =   LT        int    $101
                                                            ┌──▌  t2     int    
                                                            ├──▌  t5     int    
N007 ( 13,  7) [000006] -----------                    t6 =   AND       int    $102
N008 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   int    V02 arg2         u:1 (last use) $82
N009 (  1,  2) [000008] -----------                    t8 =    CNS_INT   int    10 $44
                                                            ┌──▌  t7     int    
                                                            ├──▌  t8     int    
N010 (  6,  4) [000009] -----------                    t9 =   LT        int    $103
                                                            ┌──▌  t6     int    
                                                            ├──▌  t9     int    
N011 ( 20, 12) [000010] -----------                   t10 =   AND       int    $104
N012 (  1,  2) [000011] -----------                   t11 =    CNS_INT   int    0 $40
                                                            ┌──▌  t10    int    
                                                            ├──▌  t11    int    
N013 ( 22, 15) [000012] J------N---                   t12 =   EQ        int    $105
N014 (  1,  2) [000014] -----------                   t14 =    CNS_INT   int    13 $45
N015 (  1,  2) [000016] -----------                   t16 =    CNS_INT   int    42 $46
                                                            ┌──▌  t12    int    
                                                            ├──▌  t14    int    
                                                            ├──▌  t16    int    
N016 ( 25, 20) [000018] -----------                   t18 =   SELECT    int   
                                                            ┌──▌  t18    int    
N017 ( 26, 21) [000017] -----------                           RETURN    int    $VN.Void

When lowering the first AND ([000006]), we will notice that its operands are compares, and lower this as:

 N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
 N002 (  1,  1) [000001] -----------                    t1 =    LCL_VAR   int    V01 arg1         u:1 $81
-                                                            ┌──▌  t0     int    
-                                                            ├──▌  t1     int    
-N003 (  6,  3) [000002] -----------                    t2 = ▌  LT        int    $100
 N004 (  1,  1) [000003] -----------                    t3 =    LCL_VAR   int    V01 arg1         u:1 (last use) $81
 N005 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V02 arg2         u:1 $82
+                                                            ┌──▌  t0     int    
+                                                            ├──▌  t1     int    
+N003 (  6,  3) [000002] -----------                         ▌  CMP       void  
                                                             ┌──▌  t3     int    
                                                             ├──▌  t4     int    
-N006 (  6,  3) [000005] -----------                    t5 = ▌  LT        int    $101
-                                                            ┌──▌  t2     int    
-                                                            ├──▌  t5     int    
-N007 ( 13,  7) [000006] -----------                    t6 = ▌  AND       int    $102
+N006 (  6,  3) [000005] -----------                         ▌  CCMP      void   cond=SLT flags=z
+N007 ( 13,  7) [000006] -----------                    t6 =    SETCC     int    cond=SLT
 N008 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   int    V02 arg2         u:1 (last use) $82
 N009 (  1,  2) [000008] -----------                    t8 =    CNS_INT   int    10 $44
                                                             ┌──▌  t7     int    

I.e. we moved the compare [000002] forward and turned it into a CMP node, and [000005] was then turned into a CCMP node.

When we see the second AND node ([000010]) we do the same thing again: verify invariance and move the entire compare chain forward:

 N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int
 N002 (  1,  1) [000001] -----------                    t1 =    LCL_VAR   int    V01 arg1         u:1 $81
 N004 (  1,  1) [000003] -----------                    t3 =    LCL_VAR   int    V01 arg1         u:1 (last use) $81
 N005 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V02 arg2         u:1 $82
+N008 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   int    V02 arg2         u:1 (last use) $82
+N009 (  1,  2) [000008] -c---------                    t8 =    CNS_INT   int    10 $44
                                                             ┌──▌  t0     int    
                                                             ├──▌  t1     int    
 N003 (  6,  3) [000002] -----------                         ▌  CMP       void  
                                                             ┌──▌  t3     int    
                                                             ├──▌  t4     int    
 N006 (  6,  3) [000005] -----------                         ▌  CCMP      void   cond=SLT flags=z
-N007 ( 13,  7) [000006] -----------                    t6 =    SETCC     int    cond=SLT
-N008 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   int    V02 arg2         u:1 (last use) $82
-N009 (  1,  2) [000008] -----------                    t8 =    CNS_INT   int    10 $44
                                                             ┌──▌  t7     int    
                                                             ├──▌  t8     int    
-N010 (  6,  4) [000009] -----------                    t9 = ▌  LT        int    $103
-                                                            ┌──▌  t6     int    
-                                                            ├──▌  t9     int    
-N011 ( 20, 12) [000010] -----------                   t10 = ▌  AND       int    $104
+N010 (  6,  4) [000009] -----------                         ▌  CCMP      void   cond=SLT flags=z
+N011 ( 20, 12) [000010] -----------                   t10 =    SETCC     int    cond=SLT
 N012 (  1,  2) [000011] -----------                   t11 =    CNS_INT   int    0 $40
                                                             ┌──▌  t10    int    
                                                             ├──▌  t11    int    

When we finally get to the SELECT we do the same thing -- move all the conditional compares in front of the SELECT and then change it to SELECTCC.

cc @a74nh

This unifies GT_SELECT/GT_SELECTCC handling between arm64 and x64. The
arm64 backend no longer uses containment for compare chains; instead,
there is a new GT_CCMP node that both produces and consumes flags, and
lowering can lower GT_AND(op, relop) down to this node.
@ghost ghost assigned jakobbotsch Feb 24, 2023
{
child->AsOp()->SetContained();
GenCondition cond2 = GenCondition::FromRelop(op2);
op2->SetOper(GT_CCMP);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know this is done elsewhere), but why is this safe? op2 is moving from a GenTreeOp to a GenTreeCCMP. The latter class is bigger than the former. What's stopping memory overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT has only two different node sizes: small nodes (of size TREE_NODE_SZ_SMALL) and large nodes (of size TREE_NODE_SZ_LARGE). Whenever a node is allocated, its size is rounded up to one of these. GenTree::InitNodeSize has various static asserts for the expected node sizes.
All the nodes here are small, so bashing between them is "safe" as far as size issues goes, though this ignores that this is not conformant C++.

Historically (before I worked on the JIT) the IR node hierarchy was not represented via inheritance, but was represented via unions instead, which made this pattern of bashing between node types less controversial. But as you've noted we still do it in various places due to the nice perf characteristics of changing nodes in place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenTree::SetOper also has asserts that the bashing done is safe as far as the size issue goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, hoped there was something like that.
Looking at the static assert checks for TREE_NODE_SZ_SMALL / TREE_NODE_SZ_LARGE, there's nothing there for GenTreeConditional

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one for GenTreeConditional with the latest commits

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@a74nh
Copy link
Contributor

a74nh commented Feb 27, 2023

Code looks fine to me. Are there any code diffs for this PR?

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Feb 27, 2023

Code looks fine to me. Are there any code diffs for this PR?

Yes, a few improvements: https://dev.azure.com/dnceng-public/public/_build/results?buildId=184319&view=ms.vss-build-web.run-extensions-tab

The ones I looked at are new cases where we use ccmp for standalone AND nodes. They weren't contained before because they do not pass the "strict" check for the compare for op1. Here's an example LIR diff:

@@ -22,9 +22,6 @@ N012 ( 30, 12) [000003] DA-XG------                         ▌  STORE_LCL_VAR r
                [000037] -----------                            IL_OFFSET void   INLRT @ 0x007[E-]
 N001 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   ref    V01 loc0         u:2 $82
 N002 (  1,  2) [000005] -c---------                    t5 =    CNS_INT   ref    null $VN.Null
-                                                            ┌──▌  t4     ref    
-                                                            ├──▌  t5     ref    
-N003 (  6,  4) [000006] -----------                    t6 = ▌  NE        int    $240
 N004 (  1,  1) [000007] -----------                    t7 =    LCL_VAR   ref    V01 loc0         u:2 (last use) $82
                                                             ┌──▌  t7     ref    
 N006 (  3,  4) [000035] -c---------                   t35 = ▌  LEA(b+96) byref 
@@ -35,11 +32,12 @@ N009 (  6,  6) [000033] -c---------                   t33 = ▌  LEA(b+104) byre
                                                             ┌──▌  t33    byref  
 N010 (  7,  5) [000019] ---XG------                   t19 = ▌  IND       int    <l:$244, c:$245>
 N011 (  1,  2) [000012] -c---------                   t12 =    CNS_INT   int    4 $44
+                                                            ┌──▌  t4     ref    
+                                                            ├──▌  t5     ref    
+N003 (  6,  4) [000006] -----------                         ▌  CMP       void  
                                                             ┌──▌  t19    int    
                                                             ├──▌  t12    int    
-N012 ( 12,  8) [000013] ---XG------                   t13 = ▌  EQ        int    <l:$249, c:$248>
-                                                            ┌──▌  t6     int    
-                                                            ├──▌  t13    int    
-N013 ( 19, 13) [000014] ---XG------                   t14 = ▌  AND       int    <l:$24d, c:$24c>
+N012 ( 12,  8) [000013] ---XG------                         ▌  CCMP      void   cond=UNE flags=0
+N013 ( 19, 13) [000014] ---XG------                   t14 =    SETCC     int    cond=UEQ
                                                             ┌──▌  t14    int    
 N014 ( 20, 14) [000015] ---XG------                         ▌  RETURN    int    <l:$14a, c:$149>

Notice that [000006] is comparing two TYP_REF values, which were filtered out by the containment logic before, but only the operands that are actually compared by the ccmp instruction needs to have the more stringent checks.

I also just realized that the varTypeIsIntegral check should actually be varTypeIsIntegralOrI since we should allowing comparing GC pointers with ccmp, but let me leave this to a follow-up (opened #82703).


FWIW, with this change you will not see TEST_EQ/TEST_NE for the compare chains, because we will end up with EQ/NE(SETCC(cond), 0) at that point. Related to that is that we are currently not great at looking through the EQ/NE(x, 0) nodes. For example:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Foo(int a, int b, int c)
{
    if (a < b & b < c & c < 10)
        return 42;

    return 13;
}

Codegen:

            cmp     w0, w1
            ccmp    w1, w2, z, lt
            ccmp    w2, #10, z, lt
            cset    x0, lt
            mov     w1, #13
            mov     w2, #42
            cmp     w0, #0
            csel    w0, w1, w2, eq

Expected codegen is something like:

cmp w0, w1
ccmp w1, w2, z, lt
ccmp w2, #10, z, lt
csel w0, w1, w2, lt

I think the best fix would be to teach LowerCompare to lower EQ/NE(cond, 0) into (reversed) cond, where cond is SETCC/relop. We will also need to teach the backend to handle JTRUE without relop children if we do that (at least it needs to learn about SETCC children, but we can probably do this more generally and make JTRUE/JCC work the same way as SELECT/SELECTCC).
I think with this additional optimization #79283 will not need any additional backend work.

@a74nh
Copy link
Contributor

a74nh commented Feb 27, 2023

(at least it needs to learn about SETCC children, but we can probably do this more generally and make JTRUE/JCC work the same way as SELECT/SELECTCC).

Agreed this makes sense to do. Is that something your planning to do? because if so I can hold off again from rebasing #79283.

@jakobbotsch
Copy link
Member Author

Agreed this makes sense to do. Is that something your planning to do? because if so I can hold off again from rebasing #79283.

Yep, I can do that in a follow-up.

This PR should be ready, cc @dotnet/jit-contrib PTAL @kunalspathak @BruceForstall

Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup. Added some questions/comments.

// Returns:
// A flags immediate that, if those flags were set, would cause the specified condition to be true.
//
insCflags Lowering::TruthifyingFlags(GenCondition condition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we repurpose InsCflagsForCcmp ? Move it to a GenTree, make it static and use it from codegen as well as lower?

Copy link
Member Author

@jakobbotsch jakobbotsch Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need that function in codegen anymore -- this PR is removing it from codegenarm64.cpp (it is effectively the same function, except without reversing the condition). In a follow-up I will add the ability to generate ccmp for OR(relop, relop) too which will use TruthifyingFlags again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, for some reason I didn't noticed that you have removed InsCflagsForCcmp().

}
else if (child->OperIsCmpCompare() && varTypeIsIntegral(child->gtGetOp1()) && varTypeIsIntegral(child->gtGetOp2()))

if (!op2->OperIsCmpCompare() || !varTypeIsIntegral(op2->gtGetOp1()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why this is also not !op2->operIsCompare(), just like you have !op1->OperIsCompare() above?

Copy link
Member Author

@jakobbotsch jakobbotsch Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operands of op2 will be compared with the ccmp instruction, and we can only do that for the "basic" relops: GT_EQ, GT_NE, GT_LE, GT_LT, GT_GT, GT_GE. The remaining compare operators, GT_TEST_EQ and GT_TEST_NE (whose semantics are (a & b) == 0 and (a & b) != 0) cannot be implemented as a conditional compare -- it would need some kind of ctst instruction.

On the other hand, the first operand can be anything that has already set the condition flags in any way, including e.g. floating point compares or tst instruction (or, a previous conditional compare, which is how we end up with the chains).

Let me add a more detailed comment here about this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a more detailed comment. I also experimented with making this logic symmetric (allow ccmp for either op1 or op2), but diffs showed that it didn't matter. Might still do it as part of the OR follow up, though.

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Mar 1, 2023
Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jakobbotsch jakobbotsch merged commit 9cfd11d into dotnet:main Mar 1, 2023
@jakobbotsch jakobbotsch deleted the selectcc-arm64 branch March 1, 2023 21:04
@ghost ghost locked as resolved and limited conversation to collaborators Apr 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants