-
Notifications
You must be signed in to change notification settings - Fork 5.2k
JIT: Support containing compares in GT_SELECT for xarch (i.e. start emitting cmov instructions) #81267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Support containing compares in GT_SELECT for xarch (i.e. start emitting cmov instructions) #81267
Conversation
This adds support for contained compares in GT_SELECT nodes for xarch. As part of this, also enables if-conversion on xarch. Fix dotnet#6749
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsThis adds support for contained compares in GT_SELECT nodes for xarch. As part of this also enables if-conversion on xarch. This partially fixes #6749. For: [MethodImpl(MethodImplOptions.NoInlining)]
static long cmov(long longValue)
{
long tmp1 = longValue & 0x00000000ffffffff;
return tmp1 == 0 ? longValue : tmp1;
} Before: ; Assembly listing for method Program:sete_or_mov(bool):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) bool -> rcx single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 3, 2 ) int -> rax
;
; Lcl frame size = 0
G_M14693_IG01: ;; offset=0000H
;; size=0 bbWeight=1 PerfScore 0.00
G_M14693_IG02: ;; offset=0000H
84C9 test cl, cl
7507 jne SHORT G_M14693_IG04
;; size=4 bbWeight=1 PerfScore 1.25
G_M14693_IG03: ;; offset=0004H
B804000000 mov eax, 4
EB02 jmp SHORT G_M14693_IG05
;; size=7 bbWeight=0.50 PerfScore 1.12
G_M14693_IG04: ;; offset=000BH
33C0 xor eax, eax
;; size=2 bbWeight=0.50 PerfScore 0.12
G_M14693_IG05: ;; offset=000DH
4898 cdqe
;; size=2 bbWeight=1 PerfScore 0.25
G_M14693_IG06: ;; offset=000FH
C3 ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 16, prolog size 0, PerfScore 5.35, instruction count 7, allocated bytes for code 16 (MethodHash=bc7ac69a) for method Program:sete_or_mov(bool):long
; =========================================================== After: ; Assembly listing for method Program:sete_or_mov(bool):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) bool -> rcx single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 2 ) int -> rdx
;
; Lcl frame size = 0
G_M14693_IG01: ;; offset=0000H
;; size=0 bbWeight=1 PerfScore 0.00
G_M14693_IG02: ;; offset=0000H
B804000000 mov eax, 4
33D2 xor edx, edx
84C9 test cl, cl
0F44D0 cmove edx, eax
4863C2 movsxd rax, edx
;; size=15 bbWeight=1 PerfScore 1.25
G_M14693_IG03: ;; offset=000FH
C3 ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 16, prolog size 0, PerfScore 3.85, instruction count 6, allocated bytes for code 16 (MethodHash=bc7ac69a) for method Program:sete_or_mov(bool):long
; ============================================================ It does not get the
|
Disregard this -- if-conversion works fine, I just had a flipped ifdef in my code. The code produced for cmov is: [MethodImpl(MethodImplOptions.NoInlining)]
static long cmov(long longValue) {
long tmp1 = longValue & 0x00000000ffffffff;
return tmp1 == 0 ? longValue : tmp1;
} Before: G_M64344_IG01:
;; size=0 bbWeight=1 PerfScore 0.00
G_M64344_IG02:
mov eax, ecx
test rax, rax
je SHORT G_M64344_IG04
;; size=7 bbWeight=1 PerfScore 1.50
G_M64344_IG03:
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
G_M64344_IG04:
mov rax, rcx
;; size=3 bbWeight=0.50 PerfScore 0.12
G_M64344_IG05:
ret
;; size=1 bbWeight=0.50 PerfScore 0.50
; Total bytes of code 12, prolog size 0, PerfScore 3.83, instruction count 6, allocated bytes for code 12 (MethodHash=16c004a7) for method Program:cmov(long):long After: G_M64344_IG01:
;; size=0 bbWeight=1 PerfScore 0.00
G_M64344_IG02:
mov eax, ecx
test rax, rax
cmove rax, rcx
;; size=9 bbWeight=1 PerfScore 0.75
G_M64344_IG03:
ret
;; size=1 bbWeight=1 PerfScore 1.00
; Total bytes of code 10, prolog size 0, PerfScore 2.75, instruction count 4, allocated bytes for code 10 (MethodHash=16c004a7) for method Program:cmov(long):long |
Definitely still some things to improve:
@@ -12,26 +12,23 @@
;
; Lcl frame size = 0
-G_M65508_IG01: ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
;
; Lcl frame size = 0
-G_M65508_IG01: ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
+G_M65508_IG01: ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
;; size=0 bbWeight=1 PerfScore 0.00
-G_M65508_IG02: ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
+G_M65508_IG02: ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
or edx, ecx
- jne SHORT G_M65508_IG05
- ;; size=4 bbWeight=1 PerfScore 1.25
-G_M65508_IG03: ; bbWeight=0.50, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
- mov eax, 1
- ;; size=5 bbWeight=0.50 PerfScore 0.12
-G_M65508_IG04: ; bbWeight=0.50, epilog, nogc, extend
+ setne al
+ movzx rax, al
+ xor edx, edx
+ mov ecx, 1
+ test eax, eax
+ mov eax, ecx
+ cmovne eax, edx
+ ;; size=22 bbWeight=1 PerfScore 2.75
+G_M65508_IG03: ; bbWeight=1, epilog, nogc, extend
ret
- ;; size=1 bbWeight=0.50 PerfScore 0.50
-G_M65508_IG05: ; bbWeight=0.50, gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref
- xor eax, eax
- ;; size=2 bbWeight=0.50 PerfScore 0.12
-G_M65508_IG06: ; bbWeight=0.50, epilog, nogc, extend
- ret |
I turned off if-conversion for the problematic scenarios (these cases turn out to be very rare, actually). |
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 4 pipeline(s). |
} | ||
|
||
#ifdef FEATURE_HW_INTRINSICS | ||
if (OperIs(GT_HWINTRINSIC) && emitter::DoesWriteZeroFlag(HWIntrinsicInfo::lookupIns(AsHWIntrinsic()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that this isn't necessarily "correct".
lookupIns
catches the most common cases, but certainly not all cases particularly when various intrinsics represent complex helpers that are lowered or when the table tracked instruction is INS_invalid
due to it requiring additional lookups/handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I added another check to the heuristic in if-conversion for the rest of the hw intrinsic cases that this PR doesn't handle yet, so that we don't regress those.
bool GenTree::CanSetZeroFlag() | ||
{ | ||
#if defined(TARGET_XARCH) | ||
if (OperIs(GT_AND, GT_OR, GT_XOR, GT_ADD, GT_SUB, GT_NEG)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing many different cases, particularly those that leave ZF
in an undefined state such as imul
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just factoring the code that was already there. I can add a TODO if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It was factored out of Lowering::OptimizeConstCompare
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I wonder if its going to cause any problematic behavior if we're depending on this to be accurate.
I could easily see someone checking !CanSetZeroFlag()
and then assuming the zero flag can't be overrwritten. Inversely I could see someone checking CanSetZeroFlag()
and assuming that it will write some 0/1 value and that it is strictly tied to the computed result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I can rename this. The specific meaning I was intending was that this is a node that supports setting the zero flag if GTF_SET_FLAGS
is set. Maybe SupportsSettingZeroFlag()
is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure Supports
fixes the main concern around people thinking !Supports
means "does not support" (and therefore will not modify zf).
I don't have a good suggestion for a different name, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might just be something we'll have to cover with docs and code review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to SupportsSettingZeroFlag
and added some more docs on it
} | ||
#endif | ||
#elif defined(TARGET_ARM64) | ||
if (OperIs(GT_AND, GT_ADD, GT_SUB)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be problematic on Arm64 since we have two different instructions for each of these?
That is one that sets the flags (adds
) and one that does not (add
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think codegen just handles this via gtSetFlags()
. Anyway there shouldn't be any behavior differences for arm64 here, this is just factoring this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, the LSRA changes seem about right based on my limited experience with it.
The diffs do look good, though there are a lot of regressions; wondering if they are actual regressions or not.
This does not improve many of the IfStatements micro benchmarks, here is the full table of results:
; Assembly listing for method IfStatements.IfStatements:AndAndInner(int,int,int)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 4.50) int -> rcx
; V01 arg1 [V01,T01] ( 4, 4 ) int -> rdx single-def
; V02 arg2 [V02,T02] ( 4, 4 ) int -> r8 single-def
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 40
G_M12418_IG01:
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M12418_IG02:
mov r9d, ecx
and r9d, 1
mov eax, edx
and eax, 1
or r9d, eax
mov eax, r8d
and eax, 1
or r9d, eax
jne SHORT G_M12418_IG04
;; size=26 bbWeight=1 PerfScore 3.00
G_M12418_IG03:
mov ecx, 5
;; size=5 bbWeight=0.50 PerfScore 0.12
G_M12418_IG04:
xor r9d, r9d
call [IfStatements.IfStatements:Consume(int,int,int,int)]
nop
;; size=10 bbWeight=1 PerfScore 3.50
G_M12418_IG05:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code 50, prolog size 4, PerfScore 13.13, instruction count 16, allocated bytes for code 50 (MethodHash=2b41cf7d) for method IfStatements.IfStatements:AndAndInner(int,int,int)
; ============================================================ Hopefully we can get these cases in the future. @@ -1,55 +1,55 @@
; Assembly listing for method IfStatements.IfStatements:AndArrayInner(int,int)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
-; V00 arg0 [V00,T00] ( 6, 5.50) int -> rcx
+; V00 arg0 [V00,T00] ( 7, 6 ) int -> rcx
; V01 arg1 [V01,T01] ( 5, 4 ) int -> rdx single-def
; V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T02] ( 3, 6 ) ref -> r9 single-def "arr expr"
; V04 tmp2 [V04,T05] ( 2, 2 ) ref -> r8 single-def "arr expr"
; V05 cse0 [V05,T03] ( 3, 2.50) ref -> r8 "CSE - aggressive"
; V06 cse1 [V06,T04] ( 3, 2.50) int -> rax "CSE - aggressive"
;
; Lcl frame size = 40
G_M49807_IG01:
sub rsp, 40
;; size=4 bbWeight=1 PerfScore 0.25
G_M49807_IG02:
mov r8, 0xD1FFAB1E ; data for IfStatements.IfStatements:inputs
mov r8, gword ptr [r8]
mov r9, r8
mov eax, dword ptr [r9+08H]
cmp ecx, eax
jae SHORT G_M49807_IG06
mov r10d, ecx
test byte ptr [r9+4*r10+10H], 1
jne SHORT G_M49807_IG04
;; size=35 bbWeight=1 PerfScore 10.00
G_M49807_IG03:
cmp edx, eax
jae SHORT G_M49807_IG06
mov r9d, edx
+ mov eax, 5
test byte ptr [r8+4*r9+10H], 1
- jne SHORT G_M49807_IG04
- mov ecx, 5
- ;; size=20 bbWeight=0.50 PerfScore 2.88
+ cmove ecx, eax
+ ;; size=21 bbWeight=0.50 PerfScore 2.50
G_M49807_IG04:
xor r8d, r8d
xor r9d, r9d
call [IfStatements.IfStatements:Consume(int,int,int,int)]
nop
;; size=13 bbWeight=1 PerfScore 3.75
G_M49807_IG05:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
G_M49807_IG06:
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00 This one is pretty strange to me considering that the input values should be random. |
Interestingly, if I change the implementation of [Benchmark]
public void AndArray() {
for (int i = 0; i < Iterations; i++) {
AndArrayInner(i, i + 1);
}
} to [Benchmark]
public void AndArray() {
int other = Iterations - 1;
for (int i = 0; i < Iterations; i++) {
AndArrayInner(i, other);
other--;
}
} then I do start to see a benefit from cmov. Note that the code is: [MethodImpl(MethodImplOptions.NoInlining)]
public static void AndArrayInner(int op1, int op2) {
if (inputs[op1] % 2 == 0 && inputs[op2] % 2 == 0) {
op1 = 5;
}
Consume(op1, op2, 0, 0);
} In the former case, the |
/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib PTAL @kunalspathak @BruceForstall (if you are available) Diffs. ARM64 diffs are due to the removal of some conditional reversing in if-conversion. Removing this benefits LSRA significantly on xarch and has an insignificant impact on arm64. Some more details in #81267 (comment) |
// guarantee that the dstReg is used only in one of falseVal/trueVal, then | ||
// we are good. | ||
// | ||
// To ensure the above we have some bespoke interference logic here on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish there was an easy way to handle this particular scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me too, but luckily this check didn't turn out to be too ugly.
I do wonder if we could use a similar trick in the interval building for RMW opers to avoid delay free in most cases, perhaps something to look at in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we could but it would defeat the "linear" nature of the linear scan provided that it doesn't hurt much in common scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand a bit more, BuildDelayFreeUses()
doesn't achieve what you want? It doesn't delay free unconditionally, but does a check about matching interval, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would defeat the "linear" nature of the linear scan
This does not affect the complexity since the number of ref positions created for op1/op2 is bounded by a constant (at most 2, for an indir with a contained LEA).
Just to understand a bit more, BuildDelayFreeUses() doesn't achieve what you want? It doesn't delay free unconditionally, but does a check about matching interval, etc.?
I did start out with BuildDelayFreeUses
, but the extra flexibility we have in codegen (by swapping the trueVal/falseVal and reversing the condition) means that we don't need to mark something delay-free in practically all situations. BuildDelayFreeUses
handles only situations where the other operand is a local, but for this particular node it could also be another contained indirection.
I looked at the RMW logic and I don't think it benefits from doing this since for the RMW opers we don't have the capability to contain two indirections anyway.
public static void Eq_byte_consume(byte a1, byte a2) { | ||
//ARM64-FULL-LINE: cmp {{w[0-9]+}}, {{w[0-9]+}} | ||
//ARM64-FULL-LINE-NEXT: csel {{w[0-9]+}}, {{w[0-9]+}}, {{w[0-9]+}}, eq | ||
//ARM64-FULL-LINE-NEXT: csel {{w[0-9]+}}, {{w[0-9]+}}, {{w[0-9]+}}, {{eq|ne}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the x64 validation works for these examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't do any disasm checks on them for x64. Added that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Co-authored-by: Bruce Forstall <brucefo@microsoft.com>
Lots of improvements on Win-x64: Also |
This adds support for contained compares in GT_SELECT nodes for xarch. As part of this also enables if-conversion on xarch.
This fixes #6749. For:
Before:
After:
(Not completely optimal, no sign extension is necessary, but that's a separate issue.)
It does not get thecmov
case from the issue, but that's because if-conversion does not handle it today (cc @a74nh -- seems like a nice opportunity).TODO: