KEMBAR78
Vectorize `basic_string::find_last_of` by AlexGuteniev · Pull Request #4934 · microsoft/STL · GitHub
Skip to content

Conversation

AlexGuteniev
Copy link
Contributor

🧭 Overview

  • Same pcmpestri approach as in find_first_of for 1 or 2 byte element. Did not bother making anything for larger characters.
  • The existing find_first_of was a bit copypasta reduced. Still the _first_ and the _last_ don't share the implementation, it would have been very confusing to iterate in different directions in a single loop.
  • The new functions have _pos in names to indicate being position based rather than pointer based. They are used in strings only, so pointers would have been extra conversion for no gain. We might add position-based find_first_of or pointer based find_last_of someday, so the naming scheme has provision for this.
  • Also the vectorization threshold was tuned (even for 8-bytes character, you'll hate it).
  • Test coverage was also expanded to cover the bug of having basic_string<unsigned long long>.

⏱️ Benchmark results

Benchmark main this
bm<AlgType::str_member_last, char>/2/3 7.79 ns 5.56 ns
bm<AlgType::str_member_last, char>/7/4 22.1 ns 12.5 ns
bm<AlgType::str_member_last, char>/9/3 23.8 ns 11.2 ns
bm<AlgType::str_member_last, char>/22/5 37.2 ns 11.9 ns
bm<AlgType::str_member_last, char>/58/2 63.3 ns 13.2 ns
bm<AlgType::str_member_last, char>/102/4 54.0 ns 16.3 ns
bm<AlgType::str_member_last, char>/325/1 119 ns 36.8 ns
bm<AlgType::str_member_last, char>/400/50 156 ns 178 ns
bm<AlgType::str_member_last, char>/1011/11 321 ns 100 ns
bm<AlgType::str_member_last, char>/1502/23 605 ns 308 ns
bm<AlgType::str_member_last, char>/3056/7 916 ns 284 ns
bm<AlgType::str_member_last, wchar_t>/2/3 5.82 ns 5.99 ns
bm<AlgType::str_member_last, wchar_t>/7/4 11.0 ns 11.6 ns
bm<AlgType::str_member_last, wchar_t>/9/3 11.2 ns 11.9 ns
bm<AlgType::str_member_last, wchar_t>/22/5 16.1 ns 12.8 ns
bm<AlgType::str_member_last, wchar_t>/58/2 36.9 ns 17.4 ns
bm<AlgType::str_member_last, wchar_t>/102/4 58.5 ns 24.1 ns
bm<AlgType::str_member_last, wchar_t>/325/1 173 ns 63.7 ns
bm<AlgType::str_member_last, wchar_t>/400/50 191 ns 221 ns
bm<AlgType::str_member_last, wchar_t>/1011/11 396 ns 393 ns
bm<AlgType::str_member_last, wchar_t>/1502/23 658 ns 573 ns
bm<AlgType::str_member_last, wchar_t>/3056/7 1285 ns 548 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2/3 6.61 ns 5.64 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/7/4 22.9 ns 22.5 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/9/3 24.2 ns 12.1 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/22/5 83.5 ns 13.0 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/58/2 66.4 ns 17.7 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/102/4 162 ns 24.6 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/325/1 213 ns 63.0 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/400/50 5418 ns 603 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1011/11 4257 ns 523 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1502/23 9358 ns 1073 ns
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/3056/7 6275 ns 550 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner September 4, 2024 08:30
@CaseyCarter CaseyCarter added the performance Must go faster label Sep 4, 2024
@StephanTLavavej StephanTLavavej self-assigned this Sep 4, 2024
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej
Copy link
Member

Thanks! I pushed small but significant changes, please double-check.

Final results on my 5950X, I see only one number of concern, not enough to block this PR.

Click to expand table:
Benchmark Before After Speedup
bm<AlgType::std_func, uint8_t>/2/3 5.59 ns 5.56 ns 1.01
bm<AlgType::std_func, uint8_t>/7/4 11.1 ns 11.1 ns 1.00
bm<AlgType::std_func, uint8_t>/9/3 11.0 ns 11.1 ns 0.99
bm<AlgType::std_func, uint8_t>/22/5 11.6 ns 11.7 ns 0.99
bm<AlgType::std_func, uint8_t>/58/2 13.0 ns 12.9 ns 1.01
bm<AlgType::std_func, uint8_t>/102/4 14.8 ns 13.2 ns 1.12
bm<AlgType::std_func, uint8_t>/325/1 8.87 ns 9.23 ns 0.96
bm<AlgType::std_func, uint8_t>/400/50 122 ns 119 ns 1.03
bm<AlgType::std_func, uint8_t>/1011/11 58.0 ns 59.0 ns 0.98
bm<AlgType::std_func, uint8_t>/1502/23 194 ns 197 ns 0.98
bm<AlgType::std_func, uint8_t>/3056/7 139 ns 141 ns 0.99
bm<AlgType::std_func, uint16_t>/2/3 6.15 ns 6.12 ns 1.00
bm<AlgType::std_func, uint16_t>/7/4 12.7 ns 12.5 ns 1.02
bm<AlgType::std_func, uint16_t>/9/3 11.8 ns 11.8 ns 1.00
bm<AlgType::std_func, uint16_t>/22/5 12.6 ns 12.3 ns 1.02
bm<AlgType::std_func, uint16_t>/58/2 15.3 ns 15.6 ns 0.98
bm<AlgType::std_func, uint16_t>/102/4 18.7 ns 18.8 ns 0.99
bm<AlgType::std_func, uint16_t>/325/1 15.3 ns 12.0 ns 1.28
bm<AlgType::std_func, uint16_t>/400/50 344 ns 385 ns 0.89
bm<AlgType::std_func, uint16_t>/1011/11 287 ns 258 ns 1.11
bm<AlgType::std_func, uint16_t>/1502/23 569 ns 577 ns 0.99
bm<AlgType::std_func, uint16_t>/3056/7 265 ns 264 ns 1.00
bm<AlgType::std_func, uint32_t>/2/3 5.64 ns 5.80 ns 0.97
bm<AlgType::std_func, uint32_t>/7/4 11.2 ns 11.3 ns 0.99
bm<AlgType::std_func, uint32_t>/9/3 9.98 ns 9.86 ns 1.01
bm<AlgType::std_func, uint32_t>/22/5 11.2 ns 11.8 ns 0.95
bm<AlgType::std_func, uint32_t>/58/2 9.66 ns 10.00 ns 0.97
bm<AlgType::std_func, uint32_t>/102/4 15.8 ns 15.2 ns 1.04
bm<AlgType::std_func, uint32_t>/325/1 23.5 ns 18.1 ns 1.30
bm<AlgType::std_func, uint32_t>/400/50 489 ns 493 ns 0.99
bm<AlgType::std_func, uint32_t>/1011/11 267 ns 264 ns 1.01
bm<AlgType::std_func, uint32_t>/1502/23 792 ns 811 ns 0.98
bm<AlgType::std_func, uint32_t>/3056/7 331 ns 335 ns 0.99
bm<AlgType::std_func, uint64_t>/2/3 6.18 ns 6.12 ns 1.01
bm<AlgType::std_func, uint64_t>/7/4 12.8 ns 12.6 ns 1.02
bm<AlgType::std_func, uint64_t>/9/3 9.45 ns 9.27 ns 1.02
bm<AlgType::std_func, uint64_t>/22/5 11.3 ns 11.2 ns 1.01
bm<AlgType::std_func, uint64_t>/58/2 14.5 ns 11.9 ns 1.22
bm<AlgType::std_func, uint64_t>/102/4 27.8 ns 27.8 ns 1.00
bm<AlgType::std_func, uint64_t>/325/1 44.4 ns 45.0 ns 0.99
bm<AlgType::std_func, uint64_t>/400/50 1059 ns 1064 ns 1.00
bm<AlgType::std_func, uint64_t>/1011/11 579 ns 577 ns 1.00
bm<AlgType::std_func, uint64_t>/1502/23 1808 ns 1811 ns 1.00
bm<AlgType::std_func, uint64_t>/3056/7 1044 ns 1044 ns 1.00
bm<AlgType::str_member_first, char>/2/3 8.16 ns 8.38 ns 0.97
bm<AlgType::str_member_first, char>/7/4 9.55 ns 9.66 ns 0.99
bm<AlgType::str_member_first, char>/9/3 10.9 ns 11.1 ns 0.98
bm<AlgType::str_member_first, char>/22/5 11.5 ns 11.7 ns 0.98
bm<AlgType::str_member_first, char>/58/2 12.8 ns 13.0 ns 0.98
bm<AlgType::str_member_first, char>/102/4 13.2 ns 14.7 ns 0.90
bm<AlgType::str_member_first, char>/325/1 23.7 ns 23.7 ns 1.00
bm<AlgType::str_member_first, char>/400/50 123 ns 112 ns 1.10
bm<AlgType::str_member_first, char>/1011/11 59.5 ns 58.3 ns 1.02
bm<AlgType::str_member_first, char>/1502/23 197 ns 194 ns 1.02
bm<AlgType::str_member_first, char>/3056/7 142 ns 141 ns 1.01
bm<AlgType::str_member_first, wchar_t>/2/3 10.5 ns 10.9 ns 0.96
bm<AlgType::str_member_first, wchar_t>/7/4 12.2 ns 12.7 ns 0.96
bm<AlgType::str_member_first, wchar_t>/9/3 13.2 ns 13.4 ns 0.99
bm<AlgType::str_member_first, wchar_t>/22/5 14.0 ns 13.9 ns 1.01
bm<AlgType::str_member_first, wchar_t>/58/2 17.2 ns 17.0 ns 1.01
bm<AlgType::str_member_first, wchar_t>/102/4 20.4 ns 20.4 ns 1.00
bm<AlgType::str_member_first, wchar_t>/325/1 39.7 ns 38.1 ns 1.04
bm<AlgType::str_member_first, wchar_t>/400/50 161 ns 161 ns 1.00
bm<AlgType::str_member_first, wchar_t>/1011/11 350 ns 261 ns 1.34
bm<AlgType::str_member_first, wchar_t>/1502/23 513 ns 514 ns 1.00
bm<AlgType::str_member_first, wchar_t>/3056/7 267 ns 267 ns 1.00
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2/3 12.3 ns 12.6 ns 0.98
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/7/4 20.1 ns 20.8 ns 0.97
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/9/3 13.3 ns 13.3 ns 1.00
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/22/5 14.2 ns 14.0 ns 1.01
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/58/2 17.4 ns 17.2 ns 1.01
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/102/4 20.4 ns 20.4 ns 1.00
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/325/1 38.9 ns 38.4 ns 1.01
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/400/50 349 ns 386 ns 0.90
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1011/11 300 ns 262 ns 1.15
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1502/23 589 ns 581 ns 1.01
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/3056/7 266 ns 267 ns 1.00
bm<AlgType::str_member_first, char32_t>/2/3 9.04 ns 11.2 ns 0.81
bm<AlgType::str_member_first, char32_t>/7/4 11.0 ns 13.0 ns 0.85
bm<AlgType::str_member_first, char32_t>/9/3 10.1 ns 12.7 ns 0.80
bm<AlgType::str_member_first, char32_t>/22/5 18.0 ns 14.6 ns 1.23
bm<AlgType::str_member_first, char32_t>/58/2 9.49 ns 12.6 ns 0.75
bm<AlgType::str_member_first, char32_t>/102/4 16.0 ns 18.0 ns 0.89
bm<AlgType::str_member_first, char32_t>/325/1 19.4 ns 22.2 ns 0.87
bm<AlgType::str_member_first, char32_t>/400/50 194 ns 208 ns 0.93
bm<AlgType::str_member_first, char32_t>/1011/11 448 ns 270 ns 1.66
bm<AlgType::str_member_first, char32_t>/1502/23 662 ns 667 ns 0.99
bm<AlgType::str_member_first, char32_t>/3056/7 1323 ns 338 ns 3.91
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2/3 9.66 ns 11.7 ns 0.83
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/7/4 16.5 ns 17.8 ns 0.93
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/9/3 10.5 ns 12.7 ns 0.83
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/22/5 14.2 ns 14.6 ns 0.97
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/58/2 9.88 ns 12.7 ns 0.78
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/102/4 16.4 ns 18.1 ns 0.91
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/325/1 19.9 ns 21.8 ns 0.91
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/400/50 493 ns 488 ns 1.01
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1011/11 270 ns 270 ns 1.00
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1502/23 808 ns 813 ns 0.99
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/3056/7 341 ns 334 ns 1.02
bm<AlgType::str_member_last, char>/2/3 7.52 ns 7.96 ns 0.94
bm<AlgType::str_member_last, char>/7/4 8.81 ns 10.2 ns 0.86
bm<AlgType::str_member_last, char>/9/3 9.23 ns 10.4 ns 0.89
bm<AlgType::str_member_last, char>/22/5 13.9 ns 10.8 ns 1.29
bm<AlgType::str_member_last, char>/58/2 20.9 ns 12.8 ns 1.63
bm<AlgType::str_member_last, char>/102/4 30.9 ns 13.3 ns 2.32
bm<AlgType::str_member_last, char>/325/1 96.5 ns 24.3 ns 3.97
bm<AlgType::str_member_last, char>/400/50 113 ns 195 ns 0.58
bm<AlgType::str_member_last, char>/1011/11 236 ns 60.6 ns 3.89
bm<AlgType::str_member_last, char>/1502/23 347 ns 196 ns 1.77
bm<AlgType::str_member_last, char>/3056/7 678 ns 142 ns 4.77
bm<AlgType::str_member_last, wchar_t>/2/3 7.75 ns 8.16 ns 0.95
bm<AlgType::str_member_last, wchar_t>/7/4 9.08 ns 9.85 ns 0.92
bm<AlgType::str_member_last, wchar_t>/9/3 10.3 ns 12.0 ns 0.86
bm<AlgType::str_member_last, wchar_t>/22/5 15.7 ns 12.6 ns 1.25
bm<AlgType::str_member_last, wchar_t>/58/2 30.4 ns 14.9 ns 2.04
bm<AlgType::str_member_last, wchar_t>/102/4 49.5 ns 18.4 ns 2.69
bm<AlgType::str_member_last, wchar_t>/325/1 151 ns 46.6 ns 3.24
bm<AlgType::str_member_last, wchar_t>/400/50 195 ns 192 ns 1.02
bm<AlgType::str_member_last, wchar_t>/1011/11 446 ns 440 ns 1.01
bm<AlgType::str_member_last, wchar_t>/1502/23 659 ns 659 ns 1.00
bm<AlgType::str_member_last, wchar_t>/3056/7 1318 ns 266 ns 4.95
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/2/3 8.41 ns 9.19 ns 0.92
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/7/4 15.4 ns 16.3 ns 0.94
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/9/3 15.8 ns 11.8 ns 1.34
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/22/5 40.3 ns 12.6 ns 3.20
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/58/2 56.3 ns 14.9 ns 3.78
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/102/4 156 ns 18.4 ns 8.48
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/325/1 221 ns 38.8 ns 5.70
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/400/50 4523 ns 328 ns 13.79
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1011/11 2956 ns 264 ns 11.20
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/1502/23 8184 ns 540 ns 15.16
bm<AlgType::str_member_last, wchar_t, L'\x03B1'>/3056/7 6328 ns 266 ns 23.79

The case bm<AlgType::str_member_last, char>/400/50 is changing rom 113 ns to 195 ns, a speedup of 0.58.

@StephanTLavavej StephanTLavavej removed their assignment Oct 8, 2024
@AlexGuteniev
Copy link
Contributor Author

Thanks! I pushed small but significant changes, please double-check.

Thanks for noticing alignment problem!

The case bm<AlgType::str_member_last, char>/400/50 is changing rom 113 ns to 195 ns, a speedup of 0.58.

This case regressed in my bencnhmark too, both in original benchmark and in today retest, but not as much as for you.

This is the case where we use bitmap. Apparently after some added code the bitmap loop has got more unfortunate alignment.

For _last_of it is never correct to subtract _Start_at from something,
the range is on the other side from _Start_at (+1 because inclusive)
CaseyCarter added a commit to StephanTLavavej/STL that referenced this pull request Oct 10, 2024
... to avoid merge conflicts with microsoftGH-4934.

This partially reverts commit 8d902d9.
Indentation is a limited resource. Think of the children!
@StephanTLavavej StephanTLavavej self-assigned this Oct 11, 2024
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

Pushed changes to fix the benchmark UB that @AlexGuteniev noticed - the iota() was overflowing char.

@StephanTLavavej StephanTLavavej merged commit 23dc7e3 into microsoft:main Oct 12, 2024
39 checks passed
@StephanTLavavej
Copy link
Member

Now streaming on Max, the new season of Find Last Of Us!

📺 🧟 🍄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants