KEMBAR78
Vectorize find-like algorithms for Clang for more types by AlexGuteniev · Pull Request #5767 · microsoft/STL · GitHub
Skip to content

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Oct 5, 2025

Resolves #5479.

⚙️ Optimization

  • Use _Is_same_and_builtin_trivially_equality_comparable in another place. Make that part common, as it repeats.
  • Adjust _Could_compare_equal_to_value_type to return true for the same types. This also handles std::byte concisely.
  • Adjust casts in vectorized dispatch to use _Find_arg_cast. Which only widens or truncates integers, and uses _Bit_cast for the rest. Also makes pointers non-special.

That's it!

🏁 Benchmark

The benchmark shows that:

  • The changes indeed dispatch to vectorized implementation, and
  • Clang auto-vectorization either does not apply, or is less efficient than manual vectorization

Only find_and_count ought to be enough. It is sufficient to show that the dispatch works.

To see that other algorithms (search_n, replace, remove, remove_copy) also benefit from the change, use their existing benchmark, and set CXXFLAGS=/arch:AVX2 /D_USE_STD_VECTOR_ALGORITHMS=0 to imitate non-optimization on integer types, and then set CXXFLAGS=/arch:AVX2 /D_USE_STD_VECTOR_ALGORITHMS=1 to see the optimization results. The newly vectorized types should perform the same, as corresponding integer types.

⏱️ Benchmark results

It was run with set CXXFLAGS=/arch:AVX2 to compete with Clang auto-vectorization fairly: the algorithms detect AVX2 at runtime.

Benchmark Before After Speedup
bm<point, not_highly_aligned_allocator, Op::FindSized>/8021/3056 749 ns 157 ns 4.77
bm<point, not_highly_aligned_allocator, Op::FindSized>/63/62 23.1 ns 3.66 ns 6.31
bm<point, not_highly_aligned_allocator, Op::FindSized>/31/30 7.58 ns 2.24 ns 3.38
bm<point, not_highly_aligned_allocator, Op::FindSized>/15/14 3.76 ns 1.72 ns 2.19
bm<point, not_highly_aligned_allocator, Op::FindSized>/7/6 1.81 ns 1.76 ns 1.03
bm<point, not_highly_aligned_allocator, Op::Count>/8021/3056 1174 ns 326 ns 3.60
bm<point, not_highly_aligned_allocator, Op::Count>/63/62 10.2 ns 4.03 ns 2.53
bm<point, not_highly_aligned_allocator, Op::Count>/31/30 5.58 ns 3.13 ns 1.78
bm<point, not_highly_aligned_allocator, Op::Count>/15/14 3.14 ns 2.89 ns 1.09
bm<point, not_highly_aligned_allocator, Op::Count>/7/6 2.20 ns 2.89 ns 0.76

Co-authored-by: S. B. Tam <cpplearner@outlook.com>
@StephanTLavavej StephanTLavavej added the performance Must go faster label Oct 6, 2025
@StephanTLavavej StephanTLavavej self-assigned this Oct 6, 2025
@StephanTLavavej
Copy link
Member

Thanks! 😻 I really like the removal of all of the 64-bit/32-bit special cases for pointers, and the is_same generalization that supersedes checking for byte. I pushed a conflict-free merge with main and trivial comment cleanups.

I had to think for a moment why T{'0'} and T{'1'} were cromulent for the aggregate, but decided that it didn't need code changes or DMIs.

@StephanTLavavej StephanTLavavej removed their assignment Oct 21, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Oct 21, 2025
@AlexGuteniev
Copy link
Contributor Author

I had to think for a moment why T{'0'} and T{'1'} were cromulent for the aggregate, but decided that it didn't need code changes or DMIs.

Ah, I initially added the template parameter to pass obviously-cromulent values, but in benchmark results it looked ugly and long, it made table cells multiline, so I abandoned that, preferring neat benchmark results over clarity, but didn't undo the template parameters change.

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Oct 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit ea857c7 into microsoft:main Oct 22, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Oct 22, 2025
@StephanTLavavej
Copy link
Member

🎉 😻 🐈

@AlexGuteniev AlexGuteniev deleted the trivial-find branch October 22, 2025 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Use Clang's __is_trivially_equality_comparable intrinsic to optimize more cases in vector algorithms

3 participants