-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Auto-vectorize arrays swap
#4991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-vectorize arrays swap
#4991
Conversation
Can't you just throw in a |
No, unfortunately. Pragma loop ivdep also doesn't help |
Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>
I want to note that |
Thanks! 🐈 I pushed fairly minor changes. Results on my 5950X:
|
A big difference in "After" columnbetween Looks like the main reason is vector over-alignment. adding I don't think the benchmark should be modified to have that though. |
If we know that alignment has a significant effect on the results, shouldn't we control it instead of letting it vary and skew the benchmark? Note that "control" doesn't mean "pick an unrealistic value that skews the results" but either pick a value that is representative or average over a set of values that are representative. |
x64 stack is 16-aligned on function entry, but for multiple variables on the stack the location of each is limited by its own alignment and alignment of the variable next to it. malloc has also 16 bytes alignment. So 8 is a good skew maybe to imitate being next to a pointer. |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
🔀 🚀 😻 |
Resolves #2683
📜 The Approach
Instead of trying to call a separately compiled implementation in the import library, reimplement the specific case in headers the way that compiler is able to vectorize it.
Do
memcpy
swap by portions of nice length, and then tail, so that compiler can use registers as immediate storage, vector registers when possible. Due to compile-time known length this will unroll perfectly.I didn't investigate a lot which length is nice, I think 64 is a good choice:
⚖️ Self swap check
[utility.swap]/7 says:
[alg.swap]/2 says:
Therefore it is UB to swap overlapping arrays.
Added an assert for that and removed the check.@frederick-vs-ja reported LWG-4165, and the check preserved.
⏱️ Benchmark results
🥇 Results interpretation