Faster optimized frozen dictionary creation (5/n) #87960

adamsitnik · 2023-06-23T10:09:59Z

With all of the recent performance improvements, around 50% of the time for creating frozen dictionaries/sets is now spent in FrozenHashTable.CalcNumBuckets.

The rest is now optimal, so my last effort was on optimizing FrozenHashTable.CalcNumBuckets.

To understand my approaches it's crucial to understand what the method is supposed to do and how it was achieving this goal so far.

The method tries to find a prime number of buckets that provide less than 5% collision rate for unique hash codes.

The first thing it does, it creates a hash set of integers and iterates over all hash codes to filter out the duplicates. It's important because no increase in buckets will avoid collisions from duplicate input hash codes.

Once we have the number of unique hash codes, it iterates over a precomputed array of prime numbers and find first prime number than is bigger than the unique hash code count.

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

Line 186 in 04133eb

    
           while ((uint)minPrimeIndexInclusive < (uint)primes.Length && uniqueCodesCount > primes[minPrimeIndexInclusive])

Then it chooses a multiplier for the max bucket count, which is 16 for small bucket count (1_000) and 3 for larger inputs.

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

Lines 196 to 200 in 04133eb

    
           // Determine the largest number of buckets we're willing to use, based on a multiple of the number of inputs. 
        
           // For smaller inputs, we allow for a larger multiplier. 
        
           int maxNumBuckets = 
        
               uniqueCodesCount * 
        
               (uniqueCodesCount >= LargeInputSizeThreshold ? MaxLargeBucketTableMultiplier : MaxSmallBucketTableMultiplier);

Then it sets the best number of collisions found so far to unique hash codes count, which means "so far the best result we found has given us 100% collision rate".

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

Line 219 in 04133eb

int bestNumCollisions = uniqueCodesCount;

And finally it iterates over all of the prime numbers from selected range, and in the nested loop it iterates over every unique hash code and marks buckets as seen.

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

Line 223 in 04133eb

    
           for (int primeIndex = minPrimeIndexInclusive; primeIndex < maxPrimeIndexExclusive; primeIndex++)

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

Line 235 in 04133eb

foreach (int code in codes)

When it finds a bucket that was already visited, it increases collision count.
When the collision count becomes equal to the best known collision count, it bails and continues the search for next prime number.

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

Lines 238 to 245 in 04133eb

    
           if ((seenBuckets[bucketNum / BitsPerInt32] & (1 << (int)bucketNum)) != 0) 
        
           { 
        
               numCollisions++; 
        
               if (numCollisions >= bestNumCollisions) 
        
               { 
        
                   // If we've already hit the previously known best number of collisions, 
        
                   // there's no point in continuing as worst case we'd just use that. 
        
                   break;

When none of the prime numbers meet our 5% collision rate criteria, it returns the bucket count with the lowest collision count found for all prime numbers.

The approaches that I've tried:

Setting initial best number of collisions to one more collision than 5%. It meant "find me the first bucket that meets the criteria" and was originally a part of Faster optimized frozen dictionary creation (3/n) #87688, but I've reverted the change. Why? Because for some inputs, it never finds a bucket that meets our criteria. And because of the fact that the inner loop stops when the collision count becomes equal to the best known collision rate, we had 0 fully-verified buckets at the end. So the process would need to be repeated with the initial best number of collisions found set to unique hash codes count. I added the fallback and wrote an app that was generating random sets of hash codes for an hour and benchmarking the old vs new approach and it turned out to be slower overall.
Binary search, as it's more likely to find matching prime number for larger prime numbers. The perf gains were very nice, but again my app for random hash set generation proven that it's possible that binary search is going to miss the prime number that meets our goal.
Based on the "it's more likely to find matching prime number for larger prime numbers" assumption, I've reverted the outer loop. It was starting at the largest acceptable prime number and ending at the smallest. The CPU time gains were great (up to 4x faster), but at the cost of increased memory consumption (in theory in worst case 16x for small inputs and x3 for large inputs).
Increasing the min bucket count to 2x unique hash code count (from x1). I've modified my random hash code generator app to print me distributions of the selected prime numbers and bucket counts. It turned out that for small inputs (less than 1000 items), only 0.05% of prime numbers that were less than 2x unique hash code count were meeting the criteria. For larger inputs, it was just 0.01%. That is why I decided to narrow the range and keep everything else. On average this gives us a 10% perf gain for small inputs, 20% for medium and 30-40% for large inputs. It's a tradeoff, since 0.05% of cases are going to use larger arrays.

In the table below you can see perf numbers for #87688 as baseline, the mentioned loop reverse and proposed change:

Method	Job	Count	Mean	Ratio	Allocated	Alloc Ratio
ToFrozenDictionary_Optimized	#87688	10	812.14 ns	1.00	1432 B	1.00
ToFrozenDictionary_Optimized	reverse	10	810.59 ns	1.00	2296 B	1.60
ToFrozenDictionary_Optimized	#87960	10	713.51 ns	0.88	1432 B	1.00

TryGetValue_True_FrozenDictionaryOptimized	#87688	10	33.76 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	reverse	10	33.89 ns	1.01	-	NA
TryGetValue_True_FrozenDictionaryOptimized	#87960	10	33.57 ns	0.99	-	NA

ToFrozenDictionary_Optimized	#87688	100	14,906.04 ns	1.00	28760 B	1.00
ToFrozenDictionary_Optimized	reverse	100	7,809.07 ns	0.52	30920 B	1.08
ToFrozenDictionary_Optimized	#87960	100	12,006.05 ns	0.81	28760 B	1.00

TryGetValue_True_FrozenDictionaryOptimized	#87688	100	1,014.49 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	reverse	100	1,015.42 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	#87960	100	1,015.73 ns	1.00	-	NA

ToFrozenDictionary_Optimized	#87688	1000	88,396.94 ns	1.00	98608 B	1.00
ToFrozenDictionary_Optimized	reverse	1000	77,223.89 ns	0.87	98608 B	1.00
ToFrozenDictionary_Optimized	#87960	1000	55,214.58 ns	0.62	98608 B	1.00

TryGetValue_True_FrozenDictionaryOptimized	#87688	1000	10,774.77 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	reverse	1000	10,748.01 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	#87960	1000	10,768.78 ns	1.00	-	NA

ToFrozenDictionary_Optimized	#87688	10000	1,110,513.45 ns	1.00	926256 B	1.00
ToFrozenDictionary_Optimized	reverse	10000	1,007,656.92 ns	0.91	926521 B	1.00
ToFrozenDictionary_Optimized	#87960	10000	779,821.62 ns	0.70	926745 B	1.00

TryGetValue_True_FrozenDictionaryOptimized	#87688	10000	223,001.44 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	reverse	10000	223,533.13 ns	1.00	-	NA
TryGetValue_True_FrozenDictionaryOptimized	#87960	10000	223,543.17 ns	1.00	-	NA

…gic into a dedicated local method

…buckets that meets our criteria is at least twice as big as the number of unique hash codes. +33% gain on average

ghost · 2023-06-23T10:10:08Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

This change may seem very simple, but getting to it was not. I am opening the PR now to start the CI process, will provide explanation with numbers and explanation within half an hour.

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Collections`, `tenet-performance`
Milestone:	-

src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs

adamsitnik added 2 commits June 22, 2023 20:00

address code review feedback from previous PR: move the duplicated lo…

db6d1f0

…gic into a dedicated local method

Based on our observations, in more than 99.5% of cases the number of …

1e071d0

…buckets that meets our criteria is at least twice as big as the number of unique hash codes. +33% gain on average

adamsitnik added area-System.Collections tenet-performance Performance related issue labels Jun 23, 2023

ghost assigned adamsitnik Jun 23, 2023

stephentoub reviewed Jun 23, 2023

View reviewed changes

src/libraries/System.Collections.Immutable/src/System/Collections/Frozen/FrozenHashTable.cs Show resolved Hide resolved

adamsitnik requested a review from stephentoub June 23, 2023 13:16

stephentoub approved these changes Jun 23, 2023

View reviewed changes

stephentoub merged commit 949dee3 into dotnet:main Jun 23, 2023

adamsitnik mentioned this pull request Jun 23, 2023

Frozen collection construction performance #87964

Closed

adamsitnik mentioned this pull request Jun 30, 2023

Regressions in FrozenDictionary #88202

Closed

ghost locked as resolved and limited conversation to collaborators Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster optimized frozen dictionary creation (5/n) #87960

Faster optimized frozen dictionary creation (5/n) #87960

Uh oh!

adamsitnik commented Jun 23, 2023 •

edited

Loading

Uh oh!

ghost commented Jun 23, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// Determine the largest number of buckets we're willing to use, based on a multiple of the number of inputs.
	// For smaller inputs, we allow for a larger multiplier.
	int maxNumBuckets =
	uniqueCodesCount *
	(uniqueCodesCount >= LargeInputSizeThreshold ? MaxLargeBucketTableMultiplier : MaxSmallBucketTableMultiplier);

	if ((seenBuckets[bucketNum / BitsPerInt32] & (1 << (int)bucketNum)) != 0)
	{
	numCollisions++;
	if (numCollisions >= bestNumCollisions)
	{
	// If we've already hit the previously known best number of collisions,
	// there's no point in continuing as worst case we'd just use that.
	break;

Faster optimized frozen dictionary creation (5/n) #87960

Faster optimized frozen dictionary creation (5/n) #87960

Uh oh!

Conversation

adamsitnik commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jun 23, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adamsitnik commented Jun 23, 2023 •

edited

Loading