KEMBAR78
Vectorize `is_sorted_until` by AlexGuteniev · Pull Request #5420 · microsoft/STL · GitHub
Skip to content

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Apr 19, 2025

🗺️ Overview

Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in remove, unique or adjacent_find, and comparisons from minmax family.

⚖️ Less and greater

Sorting in both ascending and descending order makes equal sense, although the standard picks less as the default predicate. We support both less and greater here. This is done for the first time. It wasn't done for minmax family in the past, because for minmax it is very squirrelly to reverse the predicate 🐿️.

To reduce the number of functions, less and greater distinguished by a bool parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.

The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.

Fun observation: one could attempt to pass less_equal or greater_equal too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's _DEBUG_LT_PRED to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.

🐞 Floating bugs

We support floating point types here too.

I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.

One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for /fp:except, so it should be fine:

STL/stl/inc/xutility

Lines 55 to 59 in 5762e6b

#if _USE_STD_VECTOR_ALGORITHMS && !defined(_M_FP_EXCEPT)
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1
#else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0
#endif // ^^^ not use vector algorithms or not use fast math ^^^

If anything goes wrong, the escape hatch is still there to help.

➕ Unsigned values

Like in minmax_element, we need the *_gt intrinsics that only exist for signed values, so we have to do sign correction.

Unlike in minmax_element, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.

Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for int8_t and ~39.5 for iint8_t, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.

⏱️ Benchmark results

Benchmark Before After Speedup
bm<std::int8_t, AlgType::Std>/3000/1800 434 ns 33.7 ns 12.88
bm<std::int8_t, AlgType::Rng>/3000/1800 434 ns 33.4 ns 12.99
bm<std::int16_t, AlgType::Std>/3000/1800 440 ns 67.8 ns 6.49
bm<std::int16_t, AlgType::Rng>/3000/1800 432 ns 66.5 ns 6.50
bm<std::int32_t, AlgType::Std>/3000/1800 432 ns 129 ns 3.35
bm<std::int32_t, AlgType::Rng>/3000/1800 431 ns 130 ns 3.32
bm<std::int64_t, AlgType::Std>/3000/1800 437 ns 238 ns 1.84
bm<std::int64_t, AlgType::Rng>/3000/1800 440 ns 241 ns 1.83
bm<std::uint8_t, AlgType::Std>/3000/1800 437 ns 39.4 ns 11.09
bm<std::uint8_t, AlgType::Rng>/3000/1800 440 ns 39.5 ns 11.14
bm<std::uint16_t, AlgType::Std>/3000/1800 438 ns 79.0 ns 5.54
bm<std::uint16_t, AlgType::Rng>/3000/1800 433 ns 80.9 ns 5.35
bm<std::uint32_t, AlgType::Std>/3000/1800 432 ns 146 ns 2.96
bm<std::uint32_t, AlgType::Rng>/3000/1800 429 ns 147 ns 2.92
bm<std::uint64_t, AlgType::Std>/3000/1800 500 ns 279 ns 1.79
bm<std::uint64_t, AlgType::Rng>/3000/1800 446 ns 280 ns 1.59
bm<float, AlgType::Std>/3000/1800 658 ns 112 ns 5.88
bm<float, AlgType::Rng>/3000/1800 670 ns 100 ns 6.70
bm<double, AlgType::Std>/3000/1800 653 ns 179 ns 3.65
bm<double, AlgType::Rng>/3000/1800 657 ns 177 ns 3.71

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner April 19, 2025 11:20
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 19, 2025
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews Apr 22, 2025
@StephanTLavavej StephanTLavavej removed their assignment Apr 22, 2025
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Apr 22, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 22, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 Pushed changes for a last small round of nitpicks.

Click to expand 5950X results:
Benchmark Before After Speedup
bm_is_sorted_until<std::int8_t, AlgType::Std>/3000/1800 389 ns 29.2 ns 13.32
bm_is_sorted_until<std::int8_t, AlgType::Rng>/3000/1800 392 ns 29.6 ns 13.24
bm_is_sorted_until<std::int16_t, AlgType::Std>/3000/1800 384 ns 57.9 ns 6.63
bm_is_sorted_until<std::int16_t, AlgType::Rng>/3000/1800 386 ns 58.4 ns 6.61
bm_is_sorted_until<std::int32_t, AlgType::Std>/3000/1800 387 ns 112 ns 3.46
bm_is_sorted_until<std::int32_t, AlgType::Rng>/3000/1800 390 ns 112 ns 3.48
bm_is_sorted_until<std::int64_t, AlgType::Std>/3000/1800 387 ns 201 ns 1.93
bm_is_sorted_until<std::int64_t, AlgType::Rng>/3000/1800 767 ns 202 ns 3.80
bm_is_sorted_until<std::uint8_t, AlgType::Std>/3000/1800 386 ns 34.1 ns 11.32
bm_is_sorted_until<std::uint8_t, AlgType::Rng>/3000/1800 387 ns 32.9 ns 11.76
bm_is_sorted_until<std::uint16_t, AlgType::Std>/3000/1800 387 ns 58.9 ns 6.57
bm_is_sorted_until<std::uint16_t, AlgType::Rng>/3000/1800 386 ns 59.7 ns 6.47
bm_is_sorted_until<std::uint32_t, AlgType::Std>/3000/1800 388 ns 122 ns 3.18
bm_is_sorted_until<std::uint32_t, AlgType::Rng>/3000/1800 388 ns 122 ns 3.18
bm_is_sorted_until<std::uint64_t, AlgType::Std>/3000/1800 389 ns 234 ns 1.66
bm_is_sorted_until<std::uint64_t, AlgType::Rng>/3000/1800 767 ns 235 ns 3.26
bm_is_sorted_until<float, AlgType::Std>/3000/1800 772 ns 112 ns 6.89
bm_is_sorted_until<float, AlgType::Rng>/3000/1800 771 ns 112 ns 6.88
bm_is_sorted_until<double, AlgType::Std>/3000/1800 770 ns 201 ns 3.83
bm_is_sorted_until<double, AlgType::Rng>/3000/1800 389 ns 201 ns 1.94

@StephanTLavavej StephanTLavavej removed their assignment Apr 23, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 23, 2025
@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Apr 23, 2025

5950X results

Interesting thing about floating types in your "Before" results. In mine, they were consistently slower, but I suspected "code alignment gremlins", rather than inherently slower instructions. The instructions have a bit longer encoding, making stepping into that a bit easier. Your result seem to confirm that, having one of floating results consistent with integer results.

@AlexGuteniev

This comment was marked as outdated.

@AlexGuteniev
Copy link
Contributor Author

Your result seem to confirm that, having one of floating results consistent with integer results

And this happens not because of different CPU, but because a different overload of ranges::is_sorted was used.

No, it has changed in tests, not benchmark. Then probably merging re-shuffled things.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 23, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 4c99d05 into microsoft:main Apr 23, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 23, 2025
@StephanTLavavej
Copy link
Member

Thanks for sorting this out! 😹 😻 🔢

@AlexGuteniev AlexGuteniev deleted the is_sorted_until branch April 23, 2025 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants