-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Vectorize search_n for small values of n
#5352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
tests/std/tests/VSO_0000000_vector_algorithms_search_n/test.cpp
Outdated
Show resolved
Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms_search_n/test.cpp
Outdated
Show resolved
Hide resolved
tests/std/tests/VSO_0000000_vector_algorithms_search_n/test.cpp
Outdated
Show resolved
Hide resolved
|
Thanks! 😻 I pushed moderate changes - please double-check. 5950X results:
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
All good. I want you to also review PR description and explicitly answer about n=1 case. |
|
Thanks! I think we should keep handling n=1 in the headers. Having a separate check in the separately compiled code is fine. If you want to have the check only in the headers, then a comment in the separately compiled code that we're assuming the check has already been done, would be a good idea. |
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
🕵️ 🔍 🔢 |
⚙️ The optimization
Like I mentioned in #5346, both
std::search_nandranges::search_nmake steps by n elements, and avoid going back for a good input (where there are few potential matches), so for large n values vectorization wouldn't be an improvement.Still for small n, such that vector register width is larger than n, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.
The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length. @Alcaro suggested:
Turns out this is efficient enough for AVX2 with the values of n up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.
The shift values are computed based on n. To save one variable (general purpose register), we rely on n=1 to be handled separately, and assume at least one shift to happen.
To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case with
lzcnt/tzcnt, this turned out to be not faster.The fallback is used for tails and too large n values. For tails it uses
lzcntwith inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreatesranges::search_nin<algorithm>, with slight variation.🥔 Down-level architectures support
SSE4.2 version is implementable in both senses of backporting the current approach to SSE and using
pcmpestri. I'd expect either to be of advantage for n values twice smaller than SSE register. Just feel like should not bother trying that.x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact
__ull_rshiftis used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are ofinttype to match the intrinsic parameter type.Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.
For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in #4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.
We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.
1️⃣ Special n=1 case
We need to handle this case as just
findvectorization.findvectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.The question is where we should handle this:
The latter two are indistinguishable in practice, so the real question is, if we should:
findforsearch_nwhen n=1 #5346 optimizationWith removal n=1 case from headers we get:
With keeping n=1 case in headers we get:
findpattern)memchrfor corresponding type and disabled vectorization mode✅ Test coverage
To cover the variety of possibilities, the randomized test should try different input lengths, different n, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.
The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in #933.
This PR still doesn't fully address #933 as it is asked because:
I'm not sure how much these features are required, though. If they are required, further work to complete #933 would certainly need a different PR.
🏁 Benchmarks
In addition to the
TwoZonescase inherited from ##5346 , it hasDenseSmallSequences.These two are close to normal case and worst case respectively.
TwoZones(Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.DenseSmallSequences(Dense in the table below) has too short matches of random with from 0 to n-1 interrupted by a single mismatch character.The vectorization improvement is more for
DenseSmallSequences, but we should probably care aboutTwoZonessomewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.⏱️ Benchmark results
Click to expand:
🥈 Results interpretation
For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.
The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.
The x86 results are not very good, but not too bad either.
The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.