Vectorize `is_sorted_until` #5420

AlexGuteniev · 2025-04-19T11:20:43Z

🗺️ Overview

Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in remove, unique or adjacent_find, and comparisons from minmax family.

⚖️ Less and greater

Sorting in both ascending and descending order makes equal sense, although the standard picks less as the default predicate. We support both less and greater here. This is done for the first time. It wasn't done for minmax family in the past, because for minmax it is very squirrelly to reverse the predicate 🐿️.

To reduce the number of functions, less and greater distinguished by a bool parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.

The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.

Fun observation: one could attempt to pass less_equal or greater_equal too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's _DEBUG_LT_PRED to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.

🐞 Floating bugs

We support floating point types here too.

I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.

One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for /fp:except, so it should be fine:

STL/stl/inc/xutility

Lines 55 to 59 in 5762e6b

    
           #if _USE_STD_VECTOR_ALGORITHMS && !defined(_M_FP_EXCEPT)
 
           #define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1
 
           #else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv
 
           #define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0
 
           #endif // ^^^ not use vector algorithms or not use fast math ^^^

If anything goes wrong, the escape hatch is still there to help.

➕ Unsigned values

Like in minmax_element, we need the *_gt intrinsics that only exist for signed values, so we have to do sign correction.

Unlike in minmax_element, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.

Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for int8_t and ~39.5 for iint8_t, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.

⏱️ Benchmark results

Benchmark	Before	After	Speedup
bm<std::int8_t, AlgType::Std>/3000/1800	434 ns	33.7 ns	12.88
bm<std::int8_t, AlgType::Rng>/3000/1800	434 ns	33.4 ns	12.99
bm<std::int16_t, AlgType::Std>/3000/1800	440 ns	67.8 ns	6.49
bm<std::int16_t, AlgType::Rng>/3000/1800	432 ns	66.5 ns	6.50
bm<std::int32_t, AlgType::Std>/3000/1800	432 ns	129 ns	3.35
bm<std::int32_t, AlgType::Rng>/3000/1800	431 ns	130 ns	3.32
bm<std::int64_t, AlgType::Std>/3000/1800	437 ns	238 ns	1.84
bm<std::int64_t, AlgType::Rng>/3000/1800	440 ns	241 ns	1.83
bm<std::uint8_t, AlgType::Std>/3000/1800	437 ns	39.4 ns	11.09
bm<std::uint8_t, AlgType::Rng>/3000/1800	440 ns	39.5 ns	11.14
bm<std::uint16_t, AlgType::Std>/3000/1800	438 ns	79.0 ns	5.54
bm<std::uint16_t, AlgType::Rng>/3000/1800	433 ns	80.9 ns	5.35
bm<std::uint32_t, AlgType::Std>/3000/1800	432 ns	146 ns	2.96
bm<std::uint32_t, AlgType::Rng>/3000/1800	429 ns	147 ns	2.92
bm<std::uint64_t, AlgType::Std>/3000/1800	500 ns	279 ns	1.79
bm<std::uint64_t, AlgType::Rng>/3000/1800	446 ns	280 ns	1.59
bm<float, AlgType::Std>/3000/1800	658 ns	112 ns	5.88
bm<float, AlgType::Rng>/3000/1800	670 ns	100 ns	6.70
bm<double, AlgType::Std>/3000/1800	653 ns	179 ns	3.65
bm<double, AlgType::Rng>/3000/1800	657 ns	177 ns	3.71

…edicate

benchmarks/src/is_sorted_until.cpp

stl/inc/algorithm

tests/std/include/test_is_sorted_until_support.hpp

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

tests/std/tests/VSO_0000000_vector_algorithms_floats/test.cpp

benchmarks/inc/utility.hpp

stl/src/vector_algorithms.cpp

tests/std/include/test_is_sorted_until_support.hpp

StephanTLavavej · 2025-04-23T05:45:08Z

Thanks! 😻 Pushed changes for a last small round of nitpicks.

Click to expand 5950X results:

Benchmark	Before	After	Speedup
`bm_is_sorted_until<std::int8_t, AlgType::Std>/3000/1800`	389 ns	29.2 ns	13.32
`bm_is_sorted_until<std::int8_t, AlgType::Rng>/3000/1800`	392 ns	29.6 ns	13.24
`bm_is_sorted_until<std::int16_t, AlgType::Std>/3000/1800`	384 ns	57.9 ns	6.63
`bm_is_sorted_until<std::int16_t, AlgType::Rng>/3000/1800`	386 ns	58.4 ns	6.61
`bm_is_sorted_until<std::int32_t, AlgType::Std>/3000/1800`	387 ns	112 ns	3.46
`bm_is_sorted_until<std::int32_t, AlgType::Rng>/3000/1800`	390 ns	112 ns	3.48
`bm_is_sorted_until<std::int64_t, AlgType::Std>/3000/1800`	387 ns	201 ns	1.93
`bm_is_sorted_until<std::int64_t, AlgType::Rng>/3000/1800`	767 ns	202 ns	3.80
`bm_is_sorted_until<std::uint8_t, AlgType::Std>/3000/1800`	386 ns	34.1 ns	11.32
`bm_is_sorted_until<std::uint8_t, AlgType::Rng>/3000/1800`	387 ns	32.9 ns	11.76
`bm_is_sorted_until<std::uint16_t, AlgType::Std>/3000/1800`	387 ns	58.9 ns	6.57
`bm_is_sorted_until<std::uint16_t, AlgType::Rng>/3000/1800`	386 ns	59.7 ns	6.47
`bm_is_sorted_until<std::uint32_t, AlgType::Std>/3000/1800`	388 ns	122 ns	3.18
`bm_is_sorted_until<std::uint32_t, AlgType::Rng>/3000/1800`	388 ns	122 ns	3.18
`bm_is_sorted_until<std::uint64_t, AlgType::Std>/3000/1800`	389 ns	234 ns	1.66
`bm_is_sorted_until<std::uint64_t, AlgType::Rng>/3000/1800`	767 ns	235 ns	3.26
`bm_is_sorted_until<float, AlgType::Std>/3000/1800`	772 ns	112 ns	6.89
`bm_is_sorted_until<float, AlgType::Rng>/3000/1800`	771 ns	112 ns	6.88
`bm_is_sorted_until<double, AlgType::Std>/3000/1800`	770 ns	201 ns	3.83
`bm_is_sorted_until<double, AlgType::Rng>/3000/1800`	389 ns	201 ns	1.94

AlexGuteniev · 2025-04-23T06:02:52Z

5950X results

Interesting thing about floating types in your "Before" results. In mine, they were consistently slower, but I suspected "code alignment gremlins", rather than inherently slower instructions. The instructions have a bit longer encoding, making stepping into that a bit easier. Your result seem to confirm that, having one of floating results consistent with integer results.

AlexGuteniev · 2025-04-23T10:39:21Z

Your result seem to confirm that, having one of floating results consistent with integer results

And this happens not because of different CPU, but because a different overload of ranges::is_sorted was used.

No, it has changed in tests, not benchmark. Then probably merging re-shuffled things.

StephanTLavavej · 2025-04-23T13:53:41Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-04-23T21:41:49Z

Thanks for sorting this out! 😹 😻 🔢

AlexGuteniev added 7 commits March 30, 2025 19:32

test

d51e18e

Merge branch 'microsoft:main' into is_sorted_until

4521809

benchmark

e380801

Split out predicate check from other minmax checks and add greater pr…

ca3eb92

…edicate

Signed benchmark

54a4451

Unbias distribution

d0c3720

vectorization!

be4fb5c

AlexGuteniev requested a review from a team as a code owner April 19, 2025 11:20

github-project-automation bot added this to STL Code Reviews Apr 19, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Apr 19, 2025

Make microsoft#2885 test pass

355b8f9

AlexGuteniev mentioned this pull request Apr 19, 2025

<algorithm>, <xutility>: vector algorithms may break integer class type difference iterators #5421

Closed

StephanTLavavej added the performance Must go faster label Apr 19, 2025

StephanTLavavej self-assigned this Apr 19, 2025

AlexGuteniev added 3 commits April 20, 2025 10:33

more concise predicate check

610af8f

cast here too

09e1fc0

cast here to correct type

715a4ca

StephanTLavavej requested changes Apr 22, 2025

View reviewed changes

github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews Apr 22, 2025

StephanTLavavej removed their assignment Apr 22, 2025

AlexGuteniev added 10 commits April 22, 2025 16:00

Meow. Vectorized.

08e2f9f

consistent std

b95b735

includes

44917c0

Avoid shadowing

2643495

drop top level const on declarations

c95cc06

Pointers are just unsigned integers

e0143d6

Preprocessor comments

a48cc69

C++20

b37d6a2

unexpect Arm

9a4d528

includes

516f8c0

AlexGuteniev added 3 commits April 22, 2025 16:20

named distribution

606f263

rename benchmark function itself

aff7b41

includes

896ced2

AlexGuteniev requested a review from StephanTLavavej April 22, 2025 13:27

StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews Apr 22, 2025

StephanTLavavej self-assigned this Apr 22, 2025

AlexGuteniev and others added 4 commits April 23, 2025 08:10

Merge remote-tracking branch 'upstream/main' into is_sorted_until

9605479

<memory> for allocator.

61fde12

Adjust CodeQL suppression reasons.

1f4661d

Pass input as a single range to std::ranges::is_sorted_until.

2d1385e

StephanTLavavej reviewed Apr 23, 2025

View reviewed changes

benchmarks/inc/utility.hpp Show resolved Hide resolved

stl/src/vector_algorithms.cpp Show resolved Hide resolved

tests/std/include/test_is_sorted_until_support.hpp Outdated Show resolved Hide resolved

StephanTLavavej approved these changes Apr 23, 2025

View reviewed changes

StephanTLavavej removed their assignment Apr 23, 2025

StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 23, 2025

This comment was marked as outdated.

Sign in to view

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 23, 2025

StephanTLavavej merged commit 4c99d05 into microsoft:main Apr 23, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 23, 2025

AlexGuteniev deleted the is_sorted_until branch April 23, 2025 22:30

	#if _USE_STD_VECTOR_ALGORITHMS && !defined(_M_FP_EXCEPT)
	#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 1
	#else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv
	#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0
	#endif // ^^^ not use vector algorithms or not use fast math ^^^

Vectorize is_sorted_until #5420

Vectorize is_sorted_until #5420

Uh oh!

Conversation

AlexGuteniev commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗺️ Overview

⚖️ Less and greater

🐞 Floating bugs

➕ Unsigned values

⏱️ Benchmark results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Apr 23, 2025

Uh oh!

AlexGuteniev commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

AlexGuteniev commented Apr 23, 2025

Uh oh!

StephanTLavavej commented Apr 23, 2025

Uh oh!

Uh oh!

StephanTLavavej commented Apr 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vectorize `is_sorted_until` #5420

Vectorize `is_sorted_until` #5420

AlexGuteniev commented Apr 19, 2025 •

edited

Loading

AlexGuteniev commented Apr 23, 2025 •

edited

Loading