KEMBAR78
PGO: Add new tiers by EgorBo · Pull Request #70941 · dotnet/runtime · GitHub
Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 18, 2022

This PR implements @jkotas's idea in #70410 (comment) when DOTNET_TieredPGO is enabled (it's off by default and will be so in .NET 7.0)

  1. Use R2R code for process startup
  2. Once the process startups up, switch to instrumented code for hot methods
  3. Once you collect enough PGO data, create optimized JITed version

Design

flowchart
    prestub(.NET Function) -->|Compilation| hasAO{"Marked with<br/>[AggressiveOpts]?"}
    hasAO-->|Yes|tier1ao["JIT to <b><ins>Tier1</ins></b><br/><br/>(no dynamic profile data)"]
    hasAO-->|No|hasR2R
    hasR2R{"Is prejitted (R2R)?"} -->|No| tier000

    tier000["JIT to <b><ins>Tier0</ins></b><br/><br/>(not optimized, not instrumented,<br/> with patchpoints)"]-->|Running...|ishot555
    ishot555{"Is hot?<br/>(called >30 times)"}
    ishot555-.->|No,<br/>keep running...|ishot555
    ishot555-->|Yes|tier0
   
    hasR2R -->|Yes| R2R
    R2R["Use <b><ins>R2R</ins></b> code<br/><br/>(optimized, not instrumented,<br/>no patchpoints)"] -->|Running...|ishot1
    ishot1{"Is hot?<br/>(called >30 times)"}-.->|No,<br/>keep running...|ishot1
    ishot1--->|"Yes"|tier1inst

    tier0["JIT to <b><ins>Tier0Instrumented</ins></b><br/><br/>(not optimized, instrumented,<br/> with patchpoints)"]-->|Running...|ishot5
    tier1pgo2["JIT to <b><ins>Tier1</ins></b><br/><br/>(optimized with profile data)"]
      
    tier1inst["JIT to <b><ins>Tier1Instrumented</ins></b><br/><br/>(optimized, instrumented, <br/>no patchpoints)"]
    tier1inst-->|Running...|ishot5
    ishot5{"Is hot?<br/>(called >30 times)"}-->|Yes|tier1pgo2
    ishot5-.->|No,<br/>keep running...|ishot5

Loading

@ghost ghost added the area-VM-coreclr label Jun 18, 2022
@ghost ghost assigned EgorBo Jun 18, 2022
@EgorBo EgorBo changed the title r2r -> instrumented tier0 -> optimized tier1 R2R -> instrumented tier0 -> optimized tier1 Jun 18, 2022
@jkotas
Copy link
Member

jkotas commented Jun 18, 2022

cc @noahfalk @kouvel

@EgorBo

This comment was marked as outdated.

@EgorBo

This comment was marked as outdated.

@EgorBo

This comment was marked as outdated.

@EgorBo

This comment was marked as outdated.

@EgorBo

This comment was marked as outdated.

@EgorBo EgorBo mentioned this pull request Jun 20, 2022
10 tasks
@janvorli
Copy link
Member

2. "Patch" the initial [callcounting] stub to look at tier0 instead of r2r and reset the call-counting-cell

It sounds better to me, as we never delete the call counting stubs, so this would leave less garbage.

@noahfalk
Copy link
Member

When I was doing tiered compilation originally one of the benchmarks I found helpful was MusicStore and I had the app measure its own latency for requests 0-500, 501-1000, 1000-1500, and so on. This helped me get an idea how quickly an app was able to converge to the steady state behavior. Completely up to you if a similar analysis would be useful now. One hypothesis I'd have for the worse TE numbers is that the benchmark might be short enough that it is capturing a substantial amount of pre-steady-state behavior.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 20, 2022

When I was doing tiered compilation originally one of the benchmarks I found helpful was MusicStore and I had the app measure its own latency for requests 0-500, 501-1000, 1000-1500, and so on. This helped me get an idea how quickly an app was able to converge to the steady state behavior. Completely up to you if a similar analysis would be useful now. One hypothesis I'd have for the worse TE numbers is that the benchmark might be short enough that it is capturing a substantial amount of pre-steady-state behavior.

Thanks! When I will be patching the call-counting stub we might consider resetting the call-counting-cell to some smaller number (e.g. 10)

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2022

Just realized that I can also introduce a new tier for non-r2r cases:

tier0 -> instrumented tier0 -> tier1

This solves a different problem that we have now - instrumentation is quite heavy (both in terms of TP and perf). However, as Andy noted, we need to be careful around OSR.

I have a demo locally, for now I am allocating a new callcounting stub every time because it's simpler

@EgorBo

This comment was marked as outdated.

@EgorBo EgorBo changed the title R2R -> instrumented tier0 -> optimized tier1 tier0/R2R -> instrumented tier0 -> optimized tier1 Jun 22, 2022
@EgorBo EgorBo force-pushed the new-tier branch 2 times, most recently from 69f1cf6 to ef66369 Compare June 22, 2022 12:34
@EgorBo

This comment was marked as outdated.

@EgorBo

This comment was marked as outdated.

@EgorBo EgorBo marked this pull request as ready for review June 22, 2022 14:05
@AndyAyersMS
Copy link
Member

We can't currently leave Tier1-OSR and get back to Tier0 (at least mid-method; if the method is called every so often we could switch at a call boundary. But it wouldn't help say Main which is only called once).

What we could do instead is go from Tier0 (uninstrumented) to Tier0-OSR (instrumented) and then to Tier1-OSR (instrumented). This would give some PGO data, but we might not see the early parts of the method execute with instrumentation.

@EgorBo

This comment was marked as outdated.

@AndyAyersMS
Copy link
Member

Sort of? I can post what I think of as the right flow in a bit.

@EgorBo

This comment was marked as outdated.

Copy link
Contributor

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment, otherwise the tiering stuff LGTM, thanks!

@EgorBo
Copy link
Member Author

EgorBo commented Oct 19, 2022

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@EgorBo
Copy link
Member Author

EgorBo commented Oct 19, 2022

image
X axis - time from start, seconds
Y axis - average RPS per 10 seconds.

RPS during startup for the BingSNR, X - RPS, Y - time from start, seconds

Observations:

  1. Since a lot of stuff is prejitted (they've fixed R2R issues with their builds yesterday) we don't see RPS improvements from DynamicPGO in current Main since it gives up on R2R'd code - blue and green lines have the same RPS, green one (PGO) starts a bit slower.
  2. This PR enables DynamicPGO for R2R'd code so we can see actual RPS improvements (not much but it's likely because of my setup where I had to limit bombardier to send only 2 requests at once) - we pay a big price for that during start, as you can see it takes some time for the grey line to hit the top RPS.
  3. Red line is this PR + some tunning around the queue of methods pending call-counting installation (e.g. put all R2R'd code into a separate "low priority" queue because R2R'd code is fast as is). Also, @AndyAyersMS is right, the threshold for OSR needs to be lowered, to something like 200-100 because we don't want to stuck in a slow loop for too long - lower threshold also helped the red line.

The measurements are quite stable, I did a lot of runs locally (made a script)

@EgorBo
Copy link
Member Author

EgorBo commented Oct 19, 2022

The problem with the queue of methods pending the call counting installation (not even promotion to tier1, just to start counting calls) visualized (Bing SNR again):

image

X axis - time in seconds after start.
Y axis - number of methods waiting for getting a call-counting stub

It means that a lot of methods were stuck in tier0 and had no chance to get a call-counting stub to start counting - every 100ms a new compilation occurs and it delays call-counting stub installation till we have a window large enough for it.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 19, 2022

Anyway, I'd like to work on improvements for that queue in a separate PR (and will adjust the OSR limit), e.g. to introduce a separate "low priority" queue for methods coming from R2R since it's fast as is

@AndyAyersMS can you review/approve the jit part? It seems to be passing outerloop PGO pipelines (both runtime and libraries)
@davidwrighton this PR is blocked on your change request, could you take a look if I addressed your concerns (lack of docs)?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Left notes on a few things to consider.

I take it COMPlus_WritePGOData still works as long as you also enable TieredPGO? It is occasionally quite useful.

const bool osrMethod = opts.IsOSR();
const bool useEdgeProfiles = (JitConfig.JitEdgeProfiling() > 0) && !prejit && !tier0WithPatchpoints && !osrMethod;
const bool instrOpt = opts.IsInstrumentedOptimized();
const bool useEdgeProfiles = (JitConfig.JitEdgeProfiling() > 0) && !prejit && !tier0WithPatchpoints && !instrOpt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need block profiles for full instrumented/optimized methods? Seems like edge profiles might work -- unless perhaps the think you need is the is special handling for tail calls.

if so can you add a clarifying comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we still need it for now, I'll allocate some time to investigate what is needed to enable edge-profiling after this PR

// Only schedule importation if we're not currently importing.
//
if (mustImportEntryBlock && (compCurBB != fgEntryBB))
if ((opts.IsInstrumentedOptimized() || opts.IsOSR()) && mustImportEntryBlock && (compCurBB != entryBb))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an OSR only issue, as full method compilation will naturally start importing with fgFirstBB.

// * We only instrument root method blocks in OSR methods,
//
if (opts.IsOSR() && !compIsForInlining())
if ((opts.IsInstrumentedOptimized() || opts.IsOSR()) && !compIsForInlining())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted below, I think this is an OSR only problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it could be

Suggested change
if ((opts.IsInstrumentedOptimized() || opts.IsOSR()) && !compIsForInlining())
if ((opts.IsInstrumentedOptimized() && opts.IsOSR()) && !compIsForInlining())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.IsInstrumentedOptimized() && opts.IsOSR() is not working, for some reason it hits an assert even in non PGO OSR mode. I should be able to revert these changes once I fix "edge profiling" for optimized code

@EgorBo
Copy link
Member Author

EgorBo commented Oct 22, 2022

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

Co-authored-by: Andy Ayers <andya@microsoft.com>
@build-analysis build-analysis bot mentioned this pull request Oct 25, 2022
2 tasks
@EgorBo
Copy link
Member Author

EgorBo commented Oct 26, 2022

I'm merging the PR, we should see improvements at the PGO tab at https://aka.ms/aspnet/benchmarks in a couple of days.
Meanwhile I'll work on the queue of methods pending call-counting because this PR introduces new compilations hence delays promotion to tier1, I already have a couple of improvements in that area.

Just in case: this PR doesn't affect the current defaults, it only kicks in in DOTNET_TieredPGO=1

PS: I've just kicked off a new SPMI collection

@EgorBo EgorBo merged commit 46021ad into dotnet:main Oct 26, 2022
@EgorBo EgorBo deleted the new-tier branch October 26, 2022 11:14
@AndyAyersMS
Copy link
Member

Keep a close eye on BDN results from the lab -- we may need to adjust things since some benchmarks implicitly rely on the old tiering strategy.

@EgorBo
Copy link
Member Author

EgorBo commented Oct 27, 2022

Out of the oven public results from https://aka.ms/aspnet/benchmarks (14th page)

image

🙂

DynamicPGO now have almost the same RPS as FullPGO, slightly less because of less accurate profile in optimized code (after R2R) - as expected.
Both DynamicPGO and FullPGO now start faster because they only instrument hot code, not just everything.

Improvements in the working set are unrelated - #76985. I checked that this PR doesn't regress working set (due to more compilations) or at least it's around noise.
^ UPD: #76985 also improved start-time on Linux so on the screenshot it's a joined effort from these two PRs for Time to first request. RPS is still this PR-only. On windows where #76985 has no effect we still see improvements around start-time on FullPGO as expected:
image

@ghost ghost locked as resolved and limited conversation to collaborators Nov 28, 2022
@jeffhandley jeffhandley added the blog-candidate Completed PRs that are candidate topics for blog post coverage label Mar 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-VM-coreclr blog-candidate Completed PRs that are candidate topics for blog post coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide block counts in tiered compilation from R2R images

9 participants