-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Vectorize is_sorted_until
#5420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize is_sorted_until
#5420
Conversation
|
Thanks! 😻 Pushed changes for a last small round of nitpicks. Click to expand 5950X results:
|
Interesting thing about floating types in your "Before" results. In mine, they were consistently slower, but I suspected "code alignment gremlins", rather than inherently slower instructions. The instructions have a bit longer encoding, making stepping into that a bit easier. Your result seem to confirm that, having one of floating results consistent with integer results. |
This comment was marked as outdated.
This comment was marked as outdated.
No, it has changed in tests, not benchmark. Then probably merging re-shuffled things. |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
|
Thanks for sorting this out! 😹 😻 🔢 |
🗺️ Overview
Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in
remove,uniqueoradjacent_find, and comparisons fromminmaxfamily.⚖️ Less and greater
Sorting in both ascending and descending order makes equal sense, although the standard picks
lessas the default predicate. We support bothlessandgreaterhere. This is done for the first time. It wasn't done forminmaxfamily in the past, because forminmaxit is very squirrelly to reverse the predicate 🐿️.To reduce the number of functions,
lessandgreaterdistinguished by aboolparameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.
Fun observation: one could attempt to pass
less_equalorgreater_equaltoo, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's_DEBUG_LT_PREDto alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.🐞 Floating bugs
We support floating point types here too.
I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.
One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for
/fp:except, so it should be fine:STL/stl/inc/xutility
Lines 55 to 59 in 5762e6b
If anything goes wrong, the escape hatch is still there to help.
➕ Unsigned values
Like in
minmax_element, we need the*_gtintrinsics that only exist for signed values, so we have to do sign correction.Unlike in
minmax_element, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for
int8_tand ~39.5 foriint8_t, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.⏱️ Benchmark results