KEMBAR78
DATAS tuning fixes by Maoni0 · Pull Request #105545 · dotnet/runtime · GitHub
Skip to content

Conversation

@Maoni0
Copy link
Member

@Maoni0 Maoni0 commented Jul 26, 2024

This includes the following tuning fixes

  • unusable fragmentation computation - this was causing gen1's unnecessarily. Also avoid volatility.
  • don't rethread gen2 FL items during HC change because it might take too long. Instead trigger a BGC and disable gen1 GCs during this BGC. During BGC sweep we refresh the gen2 FL completely.
  • reverse the order of calling distribute_free_regions/age_free_regions and setting BCD - what happened was each time we had to get new regions for alloc because BCD can be >> BCS. And since we don't get regions from decommit list, we keep accumulating more. For this I also cleaned up decommit_ephemeral_segment_pages and related methods and only have them defined when needed.
  • initialize generation_skip_ratio to 100, not 0 when we recommission a heap - this was causing every eph GC right after an HC change to unnecessarily be a gen1 GC
  • set dd_fragmentation for gen1 during BGC sweep otherwise we can hit the assert assert (free_list_space_decrease <= dd_fragmentation (dd)); in merge_fl_from_other_heaps because frag simply hasn't been updated after a BGC. This can also be reproed by checking for frag at the beginning of a GC.
  • if we haven't done a gen2 GC yet, and we are limited by DATAS for HC, we should trigger a gen2 immediately. this definitely helps for microbenchmarks but I'll need to make this more generic and not limited to just the initial gen2. But that'll come in a separate PR.

and some instrumentation changes

  • do_pre_gc/do_post_gc
  • other dprintf changes for DATAS investigation (so all you need to do is turning logging on)

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Maoni0 added 4 commits August 11, 2024 23:14
…f runs; fix to get STRESS_DYNAMIC_HEAP_COUNT compiled
… recommission; fix for gen1 hitting assert (free_list_space_decrease <= dd_fragmentation (dd))
…tribute_free and trigger an initial gen2 if needed
@Maoni0 Maoni0 force-pushed the datas_tuning_fixes branch from f6734fc to f4fe2d4 Compare August 12, 2024 06:31
@Maoni0 Maoni0 changed the title [WIP] DATAS tuning fixes DATAS tuning fixes Aug 12, 2024
@Maoni0
Copy link
Member Author

Maoni0 commented Aug 12, 2024

results from selected aspnet benchmarks -

Benchmark Name Average of Max heap size / baseline Average of Max heap size / fix Average of Max heap size / Comparison % Average of RPS / baseline Average of RPS / fix Average of RPS / Comparison % Average of Mean latency / baseline Average of Mean latency / fix Average of Mean latency / Comparison %
ConnectionCloseHttpSys 17.634 21.379 21.236 102.514 103.404 0.868 0.190 0.188 -1.053
FortunesEf 81.203 90.652 11.636 276.867 280.726 1.394 1.466 1.404 -4.229
FortunesPlatformEF 140.653 147.143 4.615 383.179 385.183 0.523 1.604 1.536 -4.239
JsonHttps 6.492 6.485 -0.102 704.963 706.383 0.201 0.428 0.416 -2.804
JsonMvc 48.929 35.595 -27.250 596.022 588.107 -1.328 0.926 0.972 4.968
MultipleQueriesPlatform 118.580 109.586 -7.584 45.150 44.921 -0.506 11.904 11.962 0.487
PlaintextMvc 50.893 34.822 -31.577 2,955.860 2,759.418 -6.646 1.146 1.096 -4.363
PlaintextWithParametersNoFilter 47.119 36.095 -23.397 3,585.046 3,607.935 0.638 1.188 1.086 -8.586
SingleQueryPlatform 86.337 91.659 6.165 539.281 537.285 -0.370 1.254 1.260 0.478
Stage1GrpcServerGC 85.289 91.799 7.633 1,100.797 1,104.153 0.305 1.636 1.642 0.367
Stage1Pgo 37.703 31.341 -16.873 736.405 730.666 -0.779 0.406 0.412 1.478
UpdatesPlatform 113.116 112.363 -0.665 28.050 28.184 0.478 18.838 18.518 -1.699

in general I'm seeing good results - they reduce volatility and avoids OOM in certain scenarios. some benchmarks use more memory but with increased RPS, mostly due to the fact we are updating the stable size more accurately.

unfortunately this does make PlaintextMvc's RPS lower albeit with lower memory usage - the continuous gen1 problems due to the incorrect unusable frag is fixed which is definitely improvement but there are other problems that makes baseline have better RPS by chance.

baseline did a gen1 right after HC change because skip ratio is 0 for the new heaps and that made us do a gen1. This gen1 increased the gen2 size from ~1mb to 3.4mb. Fix did a BGC with a gen0 GC and didn’t increase the gen2 obj size -

image

After this baseline kept doing gen1s due to the unusable frag and kept promoting while fix kept not promoting therefore gen2 size stayed small. The unusable frag problem is fixed but the promotion decision is still a problem especially in the baseline. We promote because we kept hitting the 1st condition in decide_on_promotion_surv

if ((threshold > (older_gen_size)) || (promoted > threshold))

but the thing is because we have >10 heaps some heap’s gen2 is simply very small so of course threshold could be more than old_gen_size like this

[ 3112]h0 promotion threshold: 314572, promoted bytes: 600 size n+1: 68864 -> promote

(68864 is older_gen_size)

And in fix we don’t hit this because gen1 size is big enough and we don’t consume gen1 budget since we don’t promote. So this is a bad cycle. fix should be doing gen1 GCs - right now it's not because we never detect that we have too many cards coming from gen1 to gen0 (generation_skip_ratio only detects when there are too many gen2 to gen1 refs vs gen2 to gen0 refs, but not gen1 to gen0). and as long as we don't demote, we still have the cards coming from gen1 to gen0. so this is I'll be fixing. also we could consider doing a gen1 collection if gen1 is much larger than gen2. In baseline gen1 is very small while in fix gen1 is much larger than gen2 (2.7mb vs 1mb).

I will be working on these fixes when I'm back from OOF to make them into RC2.

Copy link
Member

@mrsharm mrsharm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on the extensive testing conducted!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants