-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Updating Compiler::impIntrinsic to always expand hardware intrinsics. #15639
Conversation
FYI. @mikedn and @4creators, since you have been active in reviewing the related PRs |
That sounds odd, any actual example? I don't see how a constant node could be transformed into a cast of a lclvar. |
I have been hitting this problem as well. Will try to create repro. |
This is a bug that should fixed. I do not think that the right fix for this bug is to expand the hardware intrinsics unconditionally, e.g. even under minopts. |
If there is a bug, I agree this is not a proper fix. However, I do think we should be expanding hardware intrinsics unconditionally, as is done with SIMD intrinsics (see https://github.com/dotnet/coreclr/blob/master/src/jit/importer.cpp#L7173). I am rebuilding to get the JitDump now and will share shortly. |
Agree with @jkotas. Unconditionally expanded intrinsic cannot work with reflection calls. |
It is definitely not a bug. Here is the JitDump_Main.txt If you follow the code, which starts at L867:
It eventually gets given a temporary register (R9) so the argument can be passed and a call to the actual method takes place. Then, when the method is jitted (see |
@fiigii, why won't they work? Testing locally, they appear to function the same as when That is, a call to the method happens and when that method itself is jitted, the intrinsic is expanded (since it is a recursive call). If the non-recursive calls (currently only In either case, when the intrinsics are not forced expanded, we need to determine how to handle arguments that are going to be required to be constants. |
Huh? This all looks bizarre. What does reflection has to do with this? Who in the right mind would call such intrinsics via reflection?
That would be hilarious. The only way to do this with would be to generate a giant switch having one case for each supported immediate value (luckily these immediate values are only 8 bit so there are at most 256 cases). |
In any case. There is not a bug, just a poor initial explanation on my part. The We should probably always expand these intrinsics and if supporting these via reflection is actually needed, we will have to come up with a way to handle arguments that are meant to be constant at that time (maybe we could attach some metadata to the GT_CALL for intrinsics to indicate the third arg was constant). |
The moment you make a call the argument is no longer a constant, there's no way around that. |
We seem to need the feature that converts literal arguments to JIT time constants (e.g., limited partial evaluation). That feature would make |
Why would that be required? I don't understand that - could pls explain in easy way |
Having language support for |
@4creators, The issue still exists for Reflection based and for intrinsics which are not always expanded. Since the JIT has no enforcement of Once the constant is passed to a method, it becomes a local and you can no longer guarantee that is/was a constant (especially in a place like Reflection where the actual invocation could be several calls down and after the value has been boxed and placed in an array). I think we may just have to say Reflection is not supported for some of these functions (and is generally frowned upon for them in any case) |
IMO leaving a reflection access to hardware intrinsics is unnecessary and best solution would to keep it closed except perhaps for inspection/reading metadata. |
It was discussed in https://github.com/dotnet/corefx/issues/16835#issuecomment-315628433 . This is not just about direct reflection.
It does not have to be a dumb as this. We can do better than that by having a implementation specific for each intrinsic.
This is implementation deficiency that we should avoid replicating to more places if possible. MinOps should be doing as little as possible. |
I don't see any convincing argument in that comment.
That sounds even more complicated than a switch. IMO there has to be a very good use case to do something like this.
AFAIK MinOpts is intended as an escape hatch in case the JIT is bugged. The proper implementation in that case would be for |
How do you propose we do this for intrinsics which take a user provided immediate? Having a switch statement to handle 256 values seems a bit overkill... I also don't see a good way to differentiate between the cases
I think we are doing the same work either way. In fact, we might end up doing more work by not forcing these intrinsics to expand since the JIT thinks we have to do additional register allocations, copying values out of XMM0 (return register), etc. I'm also not quite sure how the non-expanded form of an instruction that takes an immediate, like Sse.Shuffle, is supposed to look (assuming we could properly handle constants). The actual generated code, when not inlined, needs to be able to handle all 256 immediates, which means actually emitting a giant jump table.... |
Basically, I don't see how, when not expanding, we can avoid having
|
We had two options:
We have been working towards option 1 so far. I believe that it is easier and overall cheaper option. If we wanted to switch to option 2, we would need a plan on how to deal with the fallout and who is going to be involved in executing it.
The non-inlined implementation needs to be functionally correct for reasonable cost. It does not have to be top performance. I think that the non-inlined implementation of
If we keep using the non-inlined implementation for minopts, the test coverage for this should come from the existing minopts runs.
I do not think that IsSupported property can change its value during the lifetime of the process like that. It would be very hard to program and test against it, e.g. refacting a method into two would be dangerous change because of IsSupported can be true in some methods and false in other methods within same process. IsSupported property means that the methods within enclosing class are functional. It does not make strict guarantees about their performance. |
From the design review: https://github.com/dotnet/apireviews/blob/master/2017/08-15-Intel%20Intrinsics/README.md
My understanding of this is that the instructions are meant to always be emitted/inlined and that we should not be providing a software fallback or emulating these instructions at all. I had thought that placing the fallback burden on the user was an explicit decision of the hardware intrinsics, given their nature. |
I think that, given the nature of hardware intrinsics, their general implementation strategy in other compilers (albeit native compilers), and their target audience, option 2 is probably the better choice overall. Many of the impacted parts listed for option 2 also apply with optimizations enabled, so they will need to be updated to handle these areas anyways. I also believe the target audience of this feature will want to get somewhat accurate information, even in debug mode. If we don't expand the intrinsics always, performance may actually drop in Debug mode, as compared to a non-hwintrinsic based version of the some algorithm. Many of the algorithms that will use hw-intrinsics rely on execute these instructions in a tight loop, and if the intrinsics (with optimizations disabled) actually compile down to two |
It is not unusual to see perf with optimizations disabled to be multiple times slower, and I have never heard people complain about it. I would want to wait for some hard data that it is a problem before doing anything about it. |
Hrm, yes, that is a serious problem unfortunately.
I'm not sure why a profiler would do something like that but in any case, it's their problem. One way or another they need to ensure that they don't significantly impact the performance of the code, otherwise they're useless. They may even need to recognize these intrinsics if they want to provide reasonable numbers, attempting to treat them as normal calls may result in a mess.
That's an interesting case. An IL interpreter would treat these intrinsics as normal calls, OK. But then who's taking care of the recursive call inside the intrinsic method? It seems that the JIT will have to be invoked or that we need to provide
I think it's incorrect to equate intrinsics with optimizations. That's not how intrinsics work in C/C++ and I don't see why it would work differently in .NET. Except, of course, "ah, the debugger/JIT can't properly handle this" type of scenarios.
It's not like people have a choice. JIT's support for debugging with optimizations enabled is extremely poor. So you need to disable optimizations if you want to debug and then naturally the program runs slower. There's nothing to complain about, except perhaps about the JIT. |
I agree completely. Normal intrinsics are just functions handled specially by the compiler and that handling can vary. Hardware intrinsics are functions that are expected to emit a very particular hardware instruction (it is basically a form of inline assembly). I don't think they should be treated the same (and no other compiler, that I am aware of, does). |
As a C++ developer, I certainly would have been more comfortable with a compile time error. Anything less adds complexity. It has been asserted that HW Intrinsic usage will be extremely limited to advanced users who care about every cycle. If this is really true the generated code will be disassembled and looked at incessantly. I certainly had a lot of experience optimizing for TI C6x processors by writing C++ code and looking at the generated assembly. The experience was not perfect, but I quickly learned exactly what assembly a given C++ code sequence would produce. I cannot comment on C#. The places where these immediate must be
In general, I would prefer functionally correct code. I would prefer to never be required to run coverage analysis. Allowing non-const to work correctly without throwing seems best. |
if ((methodFlags & CORINFO_FLG_JIT_INTRINSIC) != 0) | ||
{ | ||
// The recursive calls to Jit intrinsics are must-expand by convention. | ||
mustExpand = mustExpand || gtIsRecursiveCall(method); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're now always expanding calls to HW intrinsics then isn't this comment and logic out of date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be.
I wasn't sure if there are other JIT intrinsics, which can be recursive, but for which we do not want to always expand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added this bit just for HW intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then should we rename the bit to indicate that it is exclusively for HWIntrinsics (CORINFO_FLG_HW_INTRINSIC) or that this bit will assume "mustExpand = true" (CORINFO_FLG_MUSTEXPAND_INTRINSIC).
Based on the current logic in the VM (https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L5144) this bit is set for any method marked with [Intrinsic]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, 'bit' was a poor choice of words. We added this bit of logic in the importer just for HW intrinsics.
Not all [Intrinsic]
methods are must expand; some of them have perfectly viable IL implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I get it 😄
Then my question is: Do we expect all intrinsics for which gtIsRecursiveCall()
would return true to always expand (except for indirect invocation) or do we only expect them to always expand for hardware intrinsics?
Even in the case of the former, I'm not sure how we detect the difference between a JitIntrinsic that is recursive and one that has an IL implementation on first pass..
Interesting question. Would it be correct to rephrase that as "Are there scenarios that would motivate the introduction of a call (e.g. for profiling or other tools), such that If so, I think there might be. |
And:
I would expect that intrinsics with an IL implementation would not invoke themselves recursively. |
Is there a distinction between recursion and infinite recursion? Factorial could be implemented recursively. But it would need a termination condition. Whether the distinction matters for HW Intrinsics is another question. |
My point on this one was that, the first time we go into When we return In the "first pass", I am not sure there is a good way to determine the difference between a method which has an IL implementation and a method which will be Additionally, as @sdmaclea pointed out, there may be some methods for which a partial IL implemention exists (one that does some logic and then calls itself -- we thought about doing this for the compiler fallback on immediates, but opted not to since the higher level compiler may incorrectly optimize this). |
Possibly. I think we already know there are some scenarios where we can't expand (indirect calling), but yes there may be other scenarios where some tooling may want to explicitly disable forced expansion for these. |
Also note this pattern. Not sure if it impacts anything. /// <summary>
/// __int32 _mm256_extract_epi32 (__m256i a, const int index)
/// </summary>
public static int ExtractInt32<T>(Vector256<T> value, byte index) where T : struct
{
ThrowHelper.ThrowNotSupportedExceptionIfNonNumericType<T>();
return ExtractInt32<T>(value, index);
} |
For that pattern in particular, We are already doing the |
If you are implementing an Either the HW intrinsic expansions should be conditioned on If there are odd special cases they can be handled individually; we eventually must case out for each particular intrinsic. |
For the former case, I did it the current way because it is more efficient. Otherwise we would have:
And then again (at the bottom of the file, with the rest of the named intrinsic handling)
|
Question. If we will have fallback for IMM intrinsics, why do we need to always expand HW intrinsics in debug and minopt? |
JIT importer returns |
It probably should return the throw similar to PlatformNotSupported case. |
@fiigii, as for the question on expanding under minopts. @CarolEidt stated many of the same concerns that others have raised: #15639 (comment) |
One of the proposed fallbacks for non-const args would still expand in the recursive case -- it would just expand to something more complex. Seems like there are enough combinatorics here that some sort of master guide would be useful:
|
I think the currently proposed logic is the following: Regular Intrinsics (based on existing logic)
SIMD Intrinsics (based on existing logic)
Hardware Intrinsics (based on decisions made in this thread, with notes about questions in this thread)
JIT Intrinsics (based on existing logic, with notes about questions in this thread)
|
You may get bogus results on AVX. It is broken in similar way to how it is broken under debugger.
It should do nothing special. The intrinsic should be compiled as a regular call.
Intrinsic not recognized by the JIT should do nothing special. It should be compiled as a regular call. |
Fixed this. As a note, this will result in a
We could probably fix the bogus results on AVX with the hardware intrinsic support (would require investigation, etc in the future)
There was some back and forth on this above. @CarolEidt, could you confirm that this is what we should do (this is for when directly calling a method, such as |
I believe that the current consensus for this is that:
|
Okay, so the PR is still in a good shape based on the discussion so far (AFAICT). @AndyAyersMS, did you have any other feedback or is it good to merge? |
To be clear, "the argument passed is not a constant", the non-constant arguments cannot come from users' direct calls. At least, for Intel hardware intrinsics, user-passed non-constant arguments into direct calls have to generate the throw. |
@fiigii, I believe the sentiment is that we should not force expansion in this case and let a regular GT_CALL be emitted. The code would still compile, without throwing, and would still execute correctly (the same as if it had been called indirectly). |
Yes - that's the consensus we've reached. The expectation is that, first and foremost, the expectation is that developers using these intrinsics "know what they are doing". However, it has also been suggested that analyzers should be provided to identify cases where non-constant values are being passed where an immediate is expected. |
I'm ok with it as is. |
Ok. Thanks! I'm going to merge this shortly, provided I don't see any other feedback requesting otherwise. I believe all remaining discussions will be impactful to future code and not to the changes currently being made by this PR. |
I'm good with this as well (I did one more quick review, as it's been some time, and many comments, since I looked at it). |
Issue
Hardware intrinsics are not currently expanded inline when
minopts
orcompDbgCode
is enabled.This means that, rather than the raw instruction being emitted, a call to the hardware intrinsic method is emitted instead (ex:
Sse.Shuffle
).The hardware intrinsic method itself is recursive and will be expanded when it is jitted.
Because of this, nodes that were originally
GT_CNS_*
are nowGT_LCL_VAR
and methods which require constant parameters fail codegen.Resolution
Hardware intrinsics should either always be expanded or have a software fallback implemented for the methods which require constant parameters.
This PR does the former, but it has some drawbacks.
NOTE: On the second bullet point above, it may be that the external parts will need to be updated to work with hardware intrinsics for when optimizations are enabled anyways.
The latter (software fallback/not always expanding intrinsics) has several concerns around the usability and performance of hardware intrinsics when optimizations are disabled.
Ex: While some performance degradation is normally expected, hardware intrinsics generally map to a single underlying hardware instructions. Not expanding the intrinsics will result in a call, plus stack spilling, per instruction. This causes the overhead to be significantly greater than a normal method, which will generally execute a series of instructions. This can also cause the hardware intrinsics to perform worse than a naively serial (non-vectorized) algorithm.
Impacted Intrinsics
This will end up impacting all intrinsics which require a constant parameter for codegen to complete succesfully.
A non exhaustive list includes:
Extract
,Insert
,ShuffleHigh
,ShuffleLow
,ShiftLeftLogical
,ShiftLeftLogical128BitLane
,ShiftRightLogical
,ShiftRightLogical128BitLane
,ShiftRightArithmetic
,Blend
,MultiplySumAbsoluteDifferences
,CompareImplicitLength
,CompareExplicitLength
,CompareImplicitLengthIndex
,CompareExplicitLengthIndex
,Compare
, etc...These intrinsics are spread out across most of the exposed ISAs and several of them have multiple overloads. This means we are looking at a large number of intrinsics that will be impacted.
ARM/ARM64 is adding their own intrinsics as well and will also likely be impacted. I do not currently have a list of which of their intrinsics would be impacted.