KEMBAR78
Rework ProbabilisticMap character checks in SearchValues by MihaZupan · Pull Request #101001 · dotnet/runtime · GitHub
Skip to content

Conversation

@MihaZupan
Copy link
Member

Contributes to #100315 (comment)

The probabilistic map uses an O(i * m) fallback for -Except methods (scan all the values for each character in the input).
We use the same O(m) helper (scan the values) to confirm potential matches in the vectorized helper.

This PR adds support for the probmap to use an O(1) contains check for -Except methods and match confirmations.
The check is effectively a perfect hash check entries[value % entries.Length] == value, but implemented using a variant of FastMod. These checks can be inlined into the vectorized methods while not consuming too much memory.
We use this only when the probmap is computed as part of SearchValues as finding an optimal modulus is relatively expensive (see ProbabilisticMapState.FindModulus - there are probably smarter ways to go about it). When the probabilistic map is created for single-use IndexOfAny operations, we still use the same O(m) checks as before.

This PR also replaces the Latin1CharSearchValues implementation with BitmapCharSearchValues, which can use a bitmap of arbitrary size (not limited to [0, 255]). We use this when we guess that it'll be faster than the ProbabilisticMap (e.g. there are a lot of values in the set, or the set is dense and the probmap isn't vectorized).
I haven't changed the condition when we use ProbabilisticWithAsciiCharSearchValues as there are plausible cases where the ASCII fast path may still be useful even if the probabilistic path could be a bitmap instead.

Improvements for early matches (cheaper confirmation step)
public class IndexOfAnyMixedAsciiNonAscii
{
    private static readonly SearchValues<char> s_charsWithNonAscii = SearchValues.Create("-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");
    private const string Text = "űaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";

    [Benchmark] // This one will do the confirmation inside the vectorized path
    public int IndexOfAny() => Text.AsSpan().IndexOfAny(s_charsWithNonAscii);
}
Method Toolchain Mean Error Ratio
IndexOfAny main 11.236 ns 0.0375 ns 1.00
IndexOfAny pr 7.768 ns 0.0412 ns 0.69
Improvements for IndexOfAnyExcept (O(1) instead of O(m) character checks)
public class IndexOfAnyMixedAsciiNonAscii
{
    private static readonly SearchValues<char> s_charsWithNonAscii = SearchValues.Create("-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");
    private static readonly string _textExcept = new string('K', 1000);

    [Benchmark]
    public int IndexOfAnyExcept() => _textExcept.AsSpan().IndexOfAnyExcept(s_charsWithNonAscii);
}
Method Toolchain Mean Error Ratio
IndexOfAnyExcept main 2,832.7 ns 10.65 ns 1.00
IndexOfAnyExcept pr 675.2 ns 2.39 ns 0.24

Speedup factor is going to depend on how many values are in the set and how early in the values each input character matched with the previous implementation.
With the new implementation, the throughput is less dependent on the haystack.

The bitmap vs probmap hash O(1) checks
public class IndexOfAnyMixedAsciiNonAscii
{
    private static readonly SearchValues<char> s_bitmap = SearchValues.Create(new string('\u0080', 256) + '\u0082');
    private static readonly SearchValues<char> s_probmap = SearchValues.Create(new string('\u0080', 100) + '\uF000');
    private static readonly string TextExcept = new string('\u0080', 1000);

    [Benchmark]
    public int Bitmap() => TextExcept.AsSpan().IndexOfAnyExcept(s_bitmap);

    [Benchmark]
    public int Probmap() => TextExcept.AsSpan().IndexOfAnyExcept(s_probmap);
}
Method Mean Error
Bitmap 754.2 ns 14.42 ns
Probmap 868.3 ns 0.11 ns

This happens to bring all SearchValues<char> implementations to an O(i) worst-case (from O(i * m)).

@MihaZupan MihaZupan added this to the 9.0.0 milestone Apr 13, 2024
@MihaZupan MihaZupan requested a review from stephentoub April 13, 2024 02:11
@MihaZupan MihaZupan self-assigned this Apr 13, 2024
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member

Are all these well covered in the perf repo?

@MihaZupan
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MihaZupan
Copy link
Member Author

Are all these well covered in the perf repo?

Yes, mainly by the ßäöüÄÖÜ values.
The BitmapCharSearchValues would be covered by platforms without explicit ISA support.

Results
Method Toolchain Values Mean Error Ratio
Contains main ßäöüÄÖÜ 1.832 ns 0.0125 ns 1.00
Contains pr ßäöüÄÖÜ 1.734 ns 0.0058 ns 0.95
ContainsAny main ßäöüÄÖÜ 21.025 ns 1.5866 ns 1.01
ContainsAny pr ßäöüÄÖÜ 15.148 ns 0.1088 ns 0.73
IndexOfAny main ßäöüÄÖÜ 21.404 ns 2.7379 ns 1.03
IndexOfAny pr ßäöüÄÖÜ 15.867 ns 0.5050 ns 0.76
LastIndexOfAny main ßäöüÄÖÜ 124.294 ns 0.1215 ns 1.00
LastIndexOfAny pr ßäöüÄÖÜ 91.647 ns 0.1313 ns 0.74
LastIndexOfAnyExcept main ßäöüÄÖÜ 270.352 ns 0.0708 ns 1.00
LastIndexOfAnyExcept pr ßäöüÄÖÜ 91.775 ns 0.0788 ns 0.34
IndexOfAnyExcept main ßäöüÄÖÜ 273.921 ns 0.7002 ns 1.00
IndexOfAnyExcept pr ßäöüÄÖÜ 76.792 ns 0.3625 ns 0.28

@EgorBo
Copy link
Member

EgorBo commented May 23, 2024

Improvements on arm64: dotnet/perf-autofiling-issues#34817

Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
* Rework ProbabilisticMap character checks in SearchValues

* Reduce footprint of ProbMap SearchValues

* Update misleading comment
@github-actions github-actions bot locked and limited conversation to collaborators Jun 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants