KEMBAR78
Benchmark vectorized `reverse` and `reverse_copy`, use traits, optimize `reverse_copy` tail by AlexGuteniev · Pull Request #5493 · microsoft/STL · GitHub
Skip to content

Conversation

@AlexGuteniev
Copy link
Contributor

⚙️ Changes

  • Add benchmark, I think this is the only remaining vector algorithm without benchmarks.
  • Use traits to make maintaining easier. We are hiding the exact intrinsics, but we no longer pursue SSE2/SSE4.2 distinction, so that's fine.
  • Use AVX2 masked ops to process tail elements. Somehow this only resulted in an improvement for reverse_copy, but not for reverse. Possibly due to overlapping masked stores -- they do overlap, although overlapped part is masked out in at least one of them. (Commit history preserved this attempt).

⏱️ Benchmark results

Benchmark Before After
r<std::uint8_t>/3449 51.9 ns 46.4 ns
r<std::uint8_t>/63 8.69 ns 8.62 ns
r<std::uint8_t>/31 8.82 ns 8.71 ns
r<std::uint8_t>/15 3.29 ns 3.29 ns
r<std::uint8_t>/7 2.44 ns 2.10 ns
r<std::uint16_t>/3449 91.4 ns 94.4 ns
r<std::uint16_t>/63 4.88 ns 4.94 ns
r<std::uint16_t>/31 4.29 ns 3.99 ns
r<std::uint16_t>/15 3.46 ns 3.32 ns
r<std::uint16_t>/7 2.43 ns 2.13 ns
r<std::uint32_t>/3449 171 ns 173 ns
r<std::uint32_t>/63 4.49 ns 4.47 ns
r<std::uint32_t>/31 3.51 ns 3.62 ns
r<std::uint32_t>/15 2.81 ns 2.83 ns
r<std::uint32_t>/7 2.11 ns 2.12 ns
r<std::uint64_t>/3449 348 ns 342 ns
r<std::uint64_t>/63 5.95 ns 5.78 ns
r<std::uint64_t>/31 3.63 ns 3.61 ns
r<std::uint64_t>/15 2.86 ns 2.98 ns
r<std::uint64_t>/7 2.16 ns 2.14 ns
rc<std::uint8_t>/3449 49.0 ns 48.7 ns
rc<std::uint8_t>/63 8.66 ns 3.31 ns
rc<std::uint8_t>/31 8.26 ns 5.20 ns
rc<std::uint8_t>/15 4.31 ns 4.30 ns
rc<std::uint8_t>/7 2.60 ns 2.62 ns
rc<std::uint16_t>/3449 97.8 ns 101 ns
rc<std::uint16_t>/63 4.76 ns 3.39 ns
rc<std::uint16_t>/31 4.04 ns 2.85 ns
rc<std::uint16_t>/15 3.37 ns 3.08 ns
rc<std::uint16_t>/7 2.87 ns 2.84 ns
rc<std::uint32_t>/3449 171 ns 174 ns
rc<std::uint32_t>/63 4.21 ns 3.97 ns
rc<std::uint32_t>/31 3.52 ns 2.62 ns
rc<std::uint32_t>/15 3.27 ns 2.31 ns
rc<std::uint32_t>/7 2.46 ns 2.37 ns
rc<std::uint64_t>/3449 602 ns 603 ns
rc<std::uint64_t>/63 8.50 ns 7.02 ns
rc<std::uint64_t>/31 3.98 ns 4.03 ns
rc<std::uint64_t>/15 3.06 ns 2.84 ns
rc<std::uint64_t>/7 2.65 ns 2.28 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 11, 2025 12:01
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 11, 2025
@StephanTLavavej StephanTLavavej added performance Must go faster test Related to test code labels May 11, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 11, 2025
@StephanTLavavej StephanTLavavej removed their assignment May 15, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 15, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 16, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 2ce34fd into microsoft:main May 17, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025
@StephanTLavavej
Copy link
Member

⏱️ 🚀 🐈

@AlexGuteniev AlexGuteniev deleted the reverse branch May 17, 2025 03:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster test Related to test code

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants