KEMBAR78
Faster optimized frozen dictionary creation (5/n) by adamsitnik · Pull Request #87960 · dotnet/runtime · GitHub
Skip to content

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Jun 23, 2023

With all of the recent performance improvements, around 50% of the time for creating frozen dictionaries/sets is now spent in FrozenHashTable.CalcNumBuckets.

image

The rest is now optimal, so my last effort was on optimizing FrozenHashTable.CalcNumBuckets.

To understand my approaches it's crucial to understand what the method is supposed to do and how it was achieving this goal so far.

The method tries to find a prime number of buckets that provide less than 5% collision rate for unique hash codes.

The first thing it does, it creates a hash set of integers and iterates over all hash codes to filter out the duplicates. It's important because no increase in buckets will avoid collisions from duplicate input hash codes.

Once we have the number of unique hash codes, it iterates over a precomputed array of prime numbers and find first prime number than is bigger than the unique hash code count.

while ((uint)minPrimeIndexInclusive < (uint)primes.Length && uniqueCodesCount > primes[minPrimeIndexInclusive])

Then it chooses a multiplier for the max bucket count, which is 16 for small bucket count (1_000) and 3 for larger inputs.

// Determine the largest number of buckets we're willing to use, based on a multiple of the number of inputs.
// For smaller inputs, we allow for a larger multiplier.
int maxNumBuckets =
uniqueCodesCount *
(uniqueCodesCount >= LargeInputSizeThreshold ? MaxLargeBucketTableMultiplier : MaxSmallBucketTableMultiplier);

Then it sets the best number of collisions found so far to unique hash codes count, which means "so far the best result we found has given us 100% collision rate".

And finally it iterates over all of the prime numbers from selected range, and in the nested loop it iterates over every unique hash code and marks buckets as seen.

for (int primeIndex = minPrimeIndexInclusive; primeIndex < maxPrimeIndexExclusive; primeIndex++)

When it finds a bucket that was already visited, it increases collision count.
When the collision count becomes equal to the best known collision count, it bails and continues the search for next prime number.

if ((seenBuckets[bucketNum / BitsPerInt32] & (1 << (int)bucketNum)) != 0)
{
numCollisions++;
if (numCollisions >= bestNumCollisions)
{
// If we've already hit the previously known best number of collisions,
// there's no point in continuing as worst case we'd just use that.
break;

When none of the prime numbers meet our 5% collision rate criteria, it returns the bucket count with the lowest collision count found for all prime numbers.

The approaches that I've tried:

  • Setting initial best number of collisions to one more collision than 5%. It meant "find me the first bucket that meets the criteria" and was originally a part of Faster optimized frozen dictionary creation (3/n) #87688, but I've reverted the change. Why? Because for some inputs, it never finds a bucket that meets our criteria. And because of the fact that the inner loop stops when the collision count becomes equal to the best known collision rate, we had 0 fully-verified buckets at the end. So the process would need to be repeated with the initial best number of collisions found set to unique hash codes count. I added the fallback and wrote an app that was generating random sets of hash codes for an hour and benchmarking the old vs new approach and it turned out to be slower overall.
  • Binary search, as it's more likely to find matching prime number for larger prime numbers. The perf gains were very nice, but again my app for random hash set generation proven that it's possible that binary search is going to miss the prime number that meets our goal.
  • Based on the "it's more likely to find matching prime number for larger prime numbers" assumption, I've reverted the outer loop. It was starting at the largest acceptable prime number and ending at the smallest. The CPU time gains were great (up to 4x faster), but at the cost of increased memory consumption (in theory in worst case 16x for small inputs and x3 for large inputs).
  • Increasing the min bucket count to 2x unique hash code count (from x1). I've modified my random hash code generator app to print me distributions of the selected prime numbers and bucket counts. It turned out that for small inputs (less than 1000 items), only 0.05% of prime numbers that were less than 2x unique hash code count were meeting the criteria. For larger inputs, it was just 0.01%. That is why I decided to narrow the range and keep everything else. On average this gives us a 10% perf gain for small inputs, 20% for medium and 30-40% for large inputs. It's a tradeoff, since 0.05% of cases are going to use larger arrays.

In the table below you can see perf numbers for #87688 as baseline, the mentioned loop reverse and proposed change:

Method Job Count Mean Ratio Allocated Alloc Ratio
ToFrozenDictionary_Optimized #87688 10 812.14 ns 1.00 1432 B 1.00
ToFrozenDictionary_Optimized reverse 10 810.59 ns 1.00 2296 B 1.60
ToFrozenDictionary_Optimized #87960 10 713.51 ns 0.88 1432 B 1.00
TryGetValue_True_FrozenDictionaryOptimized #87688 10 33.76 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized reverse 10 33.89 ns 1.01 - NA
TryGetValue_True_FrozenDictionaryOptimized #87960 10 33.57 ns 0.99 - NA
ToFrozenDictionary_Optimized #87688 100 14,906.04 ns 1.00 28760 B 1.00
ToFrozenDictionary_Optimized reverse 100 7,809.07 ns 0.52 30920 B 1.08
ToFrozenDictionary_Optimized #87960 100 12,006.05 ns 0.81 28760 B 1.00
TryGetValue_True_FrozenDictionaryOptimized #87688 100 1,014.49 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized reverse 100 1,015.42 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized #87960 100 1,015.73 ns 1.00 - NA
ToFrozenDictionary_Optimized #87688 1000 88,396.94 ns 1.00 98608 B 1.00
ToFrozenDictionary_Optimized reverse 1000 77,223.89 ns 0.87 98608 B 1.00
ToFrozenDictionary_Optimized #87960 1000 55,214.58 ns 0.62 98608 B 1.00
TryGetValue_True_FrozenDictionaryOptimized #87688 1000 10,774.77 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized reverse 1000 10,748.01 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized #87960 1000 10,768.78 ns 1.00 - NA
ToFrozenDictionary_Optimized #87688 10000 1,110,513.45 ns 1.00 926256 B 1.00
ToFrozenDictionary_Optimized reverse 10000 1,007,656.92 ns 0.91 926521 B 1.00
ToFrozenDictionary_Optimized #87960 10000 779,821.62 ns 0.70 926745 B 1.00
TryGetValue_True_FrozenDictionaryOptimized #87688 10000 223,001.44 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized reverse 10000 223,533.13 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized #87960 10000 223,543.17 ns 1.00 - NA

…buckets that meets our criteria is

at least twice as big as the number of unique hash codes.

+33% gain on average
@ghost
Copy link

ghost commented Jun 23, 2023

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

This change may seem very simple, but getting to it was not. I am opening the PR now to start the CI process, will provide explanation with numbers and explanation within half an hour.

Author: adamsitnik
Assignees: -
Labels:

area-System.Collections, tenet-performance

Milestone: -

@adamsitnik adamsitnik requested a review from stephentoub June 23, 2023 13:16
@stephentoub stephentoub merged commit 949dee3 into dotnet:main Jun 23, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants