-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Initial support for zmm in .NET #80960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsThis commit includes the
|
@anthonycanino @tannergooding @BruceForstall One open I have is regarding calculating 'cost'. See change here : 47a7424. I'm not sure how the heuristic works and if |
I don't know about the cost details of the function you reference (maybe @AndyAyersMS could advise). However, don't SIMD64 and SIMD32 require similar kinds of codegen w.r.t. saving/restoring registers around a call? I don't know if there are cases where SIMD64 should be given higher (more expensive) weights that SIMD32 just because there are twice as many bits in play (so presumably load/store is more expensive?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like good progress.
src/coreclr/jit/typelist.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding It looks like TYP_SIMD32 is enabled for Arm64, and so are a lot of code paths related to it, e.g., in GenTreeVecCon. How much of the Arm64 compiler actually uses it, given that the maximum vector size is 16 bytes? I'm worried that TYP_SIMD64 will impose an even larger burden on Arm64 for no benefit.
Maybe there's not really much of a burden, except GenTreeVecCon is now a large size node, plus a bunch of unused code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also a little surprising to me that SIMD32 alignment is 16, not 32, and SIMD64 is also 16, not 64, but the comment above indicates there's a reason for that (maybe it should point at code that changes the alignment dynamically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don’t actually use it to my knowledge and any handling should be superfluous
it likely just was never ifdefd out when Arm64 was brought online
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand the question here
- Do you mean the types
SIMD32
andSIMD64
should not be defined for Arm64 at all? The following made sense in my head since both arm and x86 use the same calls to createGenTreeVecCon
etc and ifGenTreeVecCon
is usingsimd64Val
now since it's the largest, we need to do the same here :
https://github.com/dotnet/runtime/pull/80960/files#diff-547b75c8f8dd006693805c30d7a0e2ddcdd6b10691383dfbb16e3ed0c50f236aL1431-R1431 . When you say it doesn't need to be handled, do you mean the changes insideLowering::LowerHWIntrinsicCreate
above are not needed? - Not sure about the alignment being 16 there. That was another of my opens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saying that TYP_SIMD32/64
paths aren't ever used on arm64 and so we could ifdef them out and only have them on xarch.
I don't remember all the things that the alignment: 16
impacts. I believe it primarily impacts stack alignment and I don't believe we have the necessary support to change it to 32/64 today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @DeepakRajendrakumaran doesn't need to worry about the impact of TYP_SIMD64 on Arm64 too much, but we'll see what the throughput (TP) measurements say about the actual impact. Perhaps we do need to go back and make sure Arm64 isn't unduly impacted in TP or JIT code size by unused functionality.
One example: a lot of this PR is adjusting simd32_t to simd64_t in many places. Maybe there should be a "union" type of simd_t (or simd_data_t?) that is a union of simd64_t, simd32_t, simd16_t, etc. Except on Arm64 it wouldn't include simd32_t or simd64_t, since they aren't needed/used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example: a lot of this PR is adjusting simd32_t to simd64_t in many places. Maybe there should be a "union" type of simd_t (or simd_data_t?) that is a union of simd64_t, simd32_t, simd16_t, etc. Except on Arm64 it wouldn't include simd32_t or simd64_t, since they aren't needed/used.
Each type is functionally that already.
That is, simd16_t
is a union of each primitive type + 2x simd8_t
simd32_t
is likewise of each primitive type + 4x simd8_t and 2x simd16_t
So by using simd64_t
you have efficient access to just the simd8/16/32
data if necessary. The downside being that it declares 48 bytes of unused data on Arm64, and 32 bytes of unused data on hardware without AVX-512 support.
ee869d2
to
3a5f180
Compare
While we wait for @AndyAyersMS to chime in, another question is re JITEEVersionIdentifier . I assume this needs to be updated as well? No 'new' methods are added but some changes were made |
bb20e5a
to
d2e795a
Compare
Yes, the change merits a JITEEVersion change. However, I would not include that in the PR, if possible, until after all other testing has occurred. That's because it will make it impossible to run superpmi-diffs and superpmi-replay pipelines and see their results. |
Given the asserts I see in superpmi pipelines:
Maybe the existing SuperPMI collections (using the old JIT-EE interface, e.g., InstructionSet* definitions) isn't going to work. I wonder if there's some way you can "hack" the change so it basically works with the "old" JIT-EE interface, just so we can leverage the superpmi pipelines (with existing superpmi collections) to measure throughput impact, especially on Arm64 or x64 without AVX-512 enabled, as well as asmdiffs on those platforms as well. |
Appreciate the suggestion. I'm struggling with some of these failures. Will try this out :) |
Another alternative is that I could create a new SuperPMI collection using just the minimum required changes to the JIT-EE interface, including the JIT-EE GUID change. Then, you could include in this PR the very same GUID change, and it would pick up the new collection. This is a sometime time-consuming option, but let me know if you need to go this direction. (This might be a little tricky as the replay job currently explicitly prevents being run if the JIT-EE GUID changes, bit it looks like the diffs job doesn't, for some reason) |
It would be great if we had an opt in job ( |
Due to security reasons, we can't do superpmi collections on PRs and update our Azure Storage location where these are usually stored. We could possibly put the generated MCH files in Azure Storage PR artifact storage, and pull them from there, but that requires a bigger change in how our scripting works. Probably do-able, but even more work. |
I was thinking just a local collection that is thrown away after the PR. I don't think it happens frequently enough that needing storage would be a concern. |
8ece98c
to
8bc9b34
Compare
@tannergooding All requested changes should be there now. |
Throughput numbers are a lot better now... However, given the numbers also are impacting Arm64, I'd guess this is due to changes in the common code paths, maybe the @BruceForstall, what are your thoughts on taking this "as-is" and me getting up a PR that moves I think the alternative is doing some more profiling "now" (maybe with Intel VTune or AMD uProf over At a glance, we might want to make Windows x64MinOpts (+0.07% to +0.34%) Windows arm64MinOpts (+0.11% to +0.25%) |
Put up #82616 to test the throughput impact of |
Put up #82624 to test the throughput impact of handling |
Restricting TYP_SIMD32 to Arm64 only represent an approx Fixing For Arm64 we'd likely see some additional gains for We'll still "regress" some things on xarch compared to not having |
So, I assume you want both changes to be merged? I imagine this will need to be modified again as I expect some merge conflicts if the other 2 PRs go in first |
Right. I think its worth doing so, at least so we can confirm no negative impact to Arm64. I can take care of resolving the conflicts, they should be relatively straightforward. |
Sounds good. Would be nice to finally have this merged :) |
With just the It would be interesting to see where the TP improvement is coming from, likely one of the suggested refactorings (maybe around using It does look like there are some x64 diffs around Windows x64MinOpts (-0.16% to -0.01%) Windows Arm64MinOpts (-0.05% to -0.00%) |
TP issues have been resolved. There is still a small up to The small diffs that have cropped up in I plan on giving this one more review pass myself, but it should generally be good and mergable after it gets sign-off from someone in @dotnet/jit-contrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes overall LGTM.
I had left a few comments about some semi-related potential post PR cleanup that could happen. I'll work through those and get them completed.
Are we waiting on an additional review before merging this? |
Yes, it needs sign-off from someone that works on the JIT team. CC @dotnet/jit-contrib, @BruceForstall |
Bumping this since #82731 is dependent on this CC https://github.com/orgs/dotnet/teams/jit-contrib, @BruceForstall |
There were various optimization suggestions that apply to common code paths which give the improvements (namely the cases where we’re using genLog2 rather than memory lookups that span multiple cache lines) There also remain a couple common code paths doing handling of |
This commit includes the following:
Create()
andZero
using avx512