KEMBAR78
Faster optimized frozen dictionary creation (4/n) by adamsitnik · Pull Request #87876 · dotnet/runtime · GitHub
Skip to content

Conversation

@adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Jun 21, 2023

Instead of creating a dictionary of lists and a multi-dimensional array we rent a single dimension array, where every bucket has five slots.
The bucket starts at (key.Length - minLength) * 5 index of the array.
Each value is an index of the key from _keys array or just -1, which represents "null".
We avoid having two copies and re-ordering of keys and values collections.

Creation time is from x2 to x10 times faster, not more than 2x slower compared to Dictionary.
Indexing is 6-12% slower, but for large inputs it's still and order of magnitude faster than Dictionary.

Method Job Count ItemsPerBucket Mean Ratio Allocated Alloc Ratio
ToDictionary #87688 10 1 102.26 ns - 440 B -
ToFrozenDictionary_Optimized #87876 10 1 218.24 ns 0.25 488 B 0.16
ToFrozenDictionary_Optimized #87688 10 1 889.59 ns 1.00 3120 B 1.00
TryGetValue_True_Dictionary #87688 10 1 97.78 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10 1 24.48 ns 1.13 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10 1 21.69 ns 1.00 - NA
ToDictionary #87688 10 5 104.26 ns - 440 B -
ToFrozenDictionary_Optimized #87876 10 5 223.41 ns 0.43 328 B 0.33
ToFrozenDictionary_Optimized #87688 10 5 523.13 ns 1.00 1000 B 1.00
TryGetValue_True_Dictionary #87688 10 5 98.22 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10 5 90.22 ns 1.12 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10 5 80.66 ns 1.00 - NA
ToDictionary #87688 100 1 766.64 ns - 3128 B -
ToFrozenDictionary_Optimized #87876 100 1 995.67 ns 0.14 3728 B 0.12
ToFrozenDictionary_Optimized #87688 100 1 7,179.71 ns 1.00 30320 B 1.00
TryGetValue_True_Dictionary #87688 100 1 1,761.21 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 100 1 227.97 ns 1.13 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 100 1 201.72 ns 1.00 - NA
ToDictionary #87688 100 5 761.50 ns - 3128 B -
ToFrozenDictionary_Optimized #87876 100 5 996.27 ns 0.25 2128 B 0.24
ToFrozenDictionary_Optimized #87688 100 5 3,959.95 ns 1.00 8768 B 1.00
TryGetValue_True_Dictionary #87688 100 5 1,042.10 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 100 5 929.69 ns 1.08 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 100 5 864.51 ns 1.00 - NA
ToDictionary #87688 1000 1 7,234.20 ns - 31016 B -
ToFrozenDictionary_Optimized #87876 1000 1 8,918.56 ns 0.12 36128 B 0.12
ToFrozenDictionary_Optimized #87688 1000 1 77,083.15 ns 1.00 302344 B 1.00
TryGetValue_True_Dictionary #87688 1000 1 115,510.37 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 1000 1 2,559.60 ns 1.09 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 1000 1 2,341.85 ns 1.00 - NA
ToDictionary #87688 1000 5 7,541.47 ns - 31016 B -
ToFrozenDictionary_Optimized #87876 1000 5 9,438.56 ns 0.24 20128 B 0.23
ToFrozenDictionary_Optimized #87688 1000 5 39,774.73 ns 1.00 88040 B 1.00
TryGetValue_True_Dictionary #87688 1000 5 31,294.80 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 1000 5 9,168.88 ns 1.12 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 1000 5 8,156.66 ns 1.00 - NA
ToDictionary #87688 10000 1 112,776.03 ns - 283020 B -
ToFrozenDictionary_Optimized #87876 10000 1 263,961.31 ns 0.07 360130 B 0.12
ToFrozenDictionary_Optimized #87688 10000 1 4,023,881.49 ns 1.00 2942142 B 1.00
TryGetValue_True_Dictionary #87688 10000 1 12,874,506.46 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10000 1 65,572.38 ns 1.08 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10000 1 60,486.95 ns 1.00 - NA
ToDictionary #87688 10000 5 100,764.40 ns - 283090 B -
ToFrozenDictionary_Optimized #87876 10000 5 168,735.65 ns 0.15 200128 B 0.23
ToFrozenDictionary_Optimized #87688 10000 5 1,097,493.94 ns 1.00 871771 B 1.00
TryGetValue_True_Dictionary #87688 10000 5 2,732,495.46 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10000 5 113,734.85 ns 1.06 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10000 5 107,283.76 ns 1.00 - NA

 we rent a single dimension array, where every bucket has five slots.
 The bucket starts at (key.Length - minLength) * 5.
 Each value is an index of the key from _keys array
 or just -1, which represents "null".

Creation time is from 2 to 10 times faster
Indexing is 4-10% slower, but still and order of magnitude faster than Dictionary
@ghost
Copy link

ghost commented Jun 21, 2023

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

Instead of creating a dictionary of lists and a multi-dimensional array we rent a single dimension array, where every bucket has five slots.
The bucket starts at (key.Length - minLength) * 5 index of the array.
Each value is an index of the key from _keys array or just -1, which represents "null".
We avoid having two copies and re-ordering of keys and values collections.

Creation time is from x2 to x10 times faster, not more than 2x slower compared to Dictionary.
Indexing is 4-10% slower, but still and order of magnitude faster than Dictionary.

Author: adamsitnik
Assignees: -
Labels:

area-System.Collections, tenet-performance

Milestone: -

// If there would be too much empty space in the lookup array, bail.
if (groupedByLength.Count / (double)spread < EmptyLengthsRatio)
{
// This size check prevents OOM.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really :) It just prevents artificial OOMs from arrays of unsupported sizes. You could still try to rent an array of length Array.MaxLength - 1 and OOM.

int arraySize = spread * MaxPerLength;
#if NET6_0_OR_GREATER
List<KeyValuePair<string, TValue>> list = CollectionsMarshal.GetValueRefOrAddDefault(groupedByLength, s.Length, out _) ??= new List<KeyValuePair<string, TValue>>(MaxPerLength);
if (arraySize >= Array.MaxLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why >= rather than >?

internal sealed class LengthBucketsFrozenDictionary<TValue> : FrozenDictionary<string, TValue>
{
/// <summary>Allowed ratio between buckets with values and total buckets. Under this ratio, this implementation won't be used due to too much wasted space.</summary>
private const double EmptyLengthsRatio = 0.2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it's ok to do away with this concept. If I have one string of length 1_000_000 and one string of length 1, won't I now end up allocating an array of 5 million elements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have one string of length 1_000_000 and one string of length 1, won't I now end up allocating an array of 5 million elements?

This case is the first thing we check and I've left it untouched:

// If without even looking at the keys we know that some bucket will exceed the max per-bucket
// limit (pigeon hole principle), we can early-exit out without doing any further work.
int spread = maxLength - minLength + 1;
if (keys.Length / spread > MaxPerLength)
{
return null;
}

I am going to benchmark whether it's beneficial to remove this check or not and get back to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is the first thing we check and I've left it untouched:

No, that's different. The check that's there is seeing whether it can be easily predetermined whether any bucket might have too many elements in it. In contrast, the EmptyLengthsRatio you deleted was used to determine whether the lengths were too spread out such that we'd be wasting a ton of space with holes in the array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to benchmark whether it's beneficial to remove this check or not and get back to you.

For MaxPerLength you mean? For a given bucket, the implementation is doing a linear scan of every element. If 1000 strings were to end up in the same bucket and every look up involved comparing up to 1000 strings, it would absolutely be worse than an O(1) dictionary lookup. So there is going to be some threshold where we no longer want to use this strategy. It's certainly possible, especially with the other changes, that MaxPerLength can be increased, but it can't become infinite.

Comment on lines 78 to 82
if (buckets[index] < 0) buckets[index] = i;
else if (buckets[index + 1] < 0) buckets[index + 1] = i;
else if (buckets[index + 2] < 0) buckets[index + 2] = i;
else if (buckets[index + 3] < 0) buckets[index + 3] = i;
else if (buckets[index + 4] < 0) buckets[index + 4] = i;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's brittle to manually unroll this loop that's tied to the MaxPerLength constant (e.g. if the constant were increased or decreased, this unrolling would be wrong). Can we just make this a loop?

Comment on lines 93 to 96
// AllocateUninitializedArray is slower for small inputs, hence the size check.
int[] copy = arraySize < 1000
? new int[arraySize]
: GC.AllocateUninitializedArray<int>(arraySize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first thing GC.AllocateUninitializedArray<int> does is a length < 512 check. Can we just let it do its thing rather than also checking here?

@danmoseley
Copy link
Member

danmoseley commented Jun 22, 2023

Cc @geeknoid in case he's interested in this series of PRs

@adamsitnik
Copy link
Member Author

Updated perf numbers: Creation time is from x2 to x16 times faster, not more than 2x slower compared to Dictionary.
Indexing is 0-8% slower, but for large inputs it's still and order of magnitude faster than Dictionary.

Method Job Count PerBucket Mean Ratio Allocated Alloc Ratio
ToDictionary #87688 10 1 102.98 ns - 440 B -
ToFrozenDictionary_Optimized #87876 10 1 232.24 ns 0.26 488 B 0.16
ToFrozenDictionary_Optimized #87688 10 1 908.94 ns 1.00 3120 B 1.00
TryGetValue_True_Dictionary #87688 10 1 94.47 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10 1 22.66 ns 1.04 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10 1 21.77 ns 1.00 - NA
ToDictionary #87688 10 5 101.63 ns - 440 B -
ToFrozenDictionary_Optimized #87876 10 5 236.91 ns 0.46 328 B 0.33
ToFrozenDictionary_Optimized #87688 10 5 520.80 ns 1.00 1000 B 1.00
TryGetValue_True_Dictionary #87688 10 5 89.71 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10 5 88.69 ns 1.07 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10 5 82.67 ns 1.00 - NA
ToDictionary #87688 100 1 772.06 ns - 3128 B -
ToFrozenDictionary_Optimized #87876 100 1 1,014.77 ns 0.14 3728 B 0.12
ToFrozenDictionary_Optimized #87688 100 1 7,009.70 ns 1.00 30320 B 1.00
TryGetValue_True_Dictionary #87688 100 1 1,749.18 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 100 1 203.88 ns 1.02 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 100 1 200.65 ns 1.00 - NA
ToDictionary #87688 100 5 763.90 ns - 3128 B -
ToFrozenDictionary_Optimized #87876 100 5 1,076.07 ns 0.27 2128 B 0.24
ToFrozenDictionary_Optimized #87688 100 5 3,980.54 ns 1.00 8768 B 1.00
TryGetValue_True_Dictionary #87688 100 5 1,031.84 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 100 5 968.47 ns 1.09 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 100 5 886.03 ns 1.00 - NA
ToDictionary #87688 1000 1 7,414.00 ns - 31016 B -
ToFrozenDictionary_Optimized #87876 1000 1 9,243.94 ns 0.12 36128 B 0.12
ToFrozenDictionary_Optimized #87688 1000 1 76,829.34 ns 1.00 302344 B 1.00
TryGetValue_True_Dictionary #87688 1000 1 113,945.71 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 1000 1 2,335.38 ns 1.00 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 1000 1 2,327.41 ns 1.00 - NA
ToDictionary #87688 1000 5 7,124.45 ns - 31016 B -
ToFrozenDictionary_Optimized #87876 1000 5 9,365.89 ns 0.24 20128 B 0.23
ToFrozenDictionary_Optimized #87688 1000 5 38,455.22 ns 1.00 88040 B 1.00
TryGetValue_True_Dictionary #87688 1000 5 30,516.19 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 1000 5 8,836.81 ns 1.05 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 1000 5 8,438.84 ns 1.00 - NA
ToDictionary #87688 10000 1 115,595.22 ns - 283023 B 1.00
ToFrozenDictionary_Optimized #87876 10000 1 253,471.88 ns 0.06 360129 B 0.12
ToFrozenDictionary_Optimized #87688 10000 1 3,989,689.41 ns 1.00 2942141 B 1.00
TryGetValue_True_Dictionary #87688 10000 1 12,715,860.97 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10000 1 63,461.63 ns 1.06 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10000 1 59,874.28 ns 1.00 - NA
ToDictionary #87688 10000 5 95,454.42 ns - 283088 B -
ToFrozenDictionary_Optimized #87876 10000 5 176,332.31 ns 0.17 200129 B 0.23
ToFrozenDictionary_Optimized #87688 10000 5 1,068,480.18 ns 1.00 871771 B 1.00
TryGetValue_True_Dictionary #87688 10000 5 2,605,945.23 ns - - NA
TryGetValue_True_FrozenDictionaryOptimized #87876 10000 5 114,240.60 ns 1.08 - NA
TryGetValue_True_FrozenDictionaryOptimized #87688 10000 5 105,536.70 ns 1.00 - NA

@stephentoub stephentoub merged commit a05537e into dotnet:main Jun 23, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants