-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Introduce WASM JITs for interpreter opcodes, do_jit_call, and interp_entry wrappers #76477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsThis (draft) PR introduces a 'jiterpreter' which on-the-fly compiles 'traces' of interpreter opcodes into WebAssembly so that the interpreter can transition into and out of them for closer-to-native performance when AOT isn't possible. See https://paper.dropbox.com/doc/Jiterpreter-For-WebAssembly--BqLn88GjC3z4GDuE09vamtw4Ag-mJ5TRc7xF1PNAGMgwU49K for a detailed overview. The main changes can be broken down into:
This PR also contains a few small optimizations that significantly improve the performance of the interpreter (and as a result, the jiterpreter) on some compute-heavy test cases like raytracers. One of the raytracer test cases is currently included in the PR (but will probably be removed). While the PR is not 'ready', review for the runtime/interpreter internals bits would be appreciated, there are definitely parts that feel like they need cleanup and suggestions on how to do them right would be great. At present this is able to pass the full System.Runtime test suite in the console with performance roughly at parity with it disabled (including the time spent actually compiling WASM), and for the raytracer test cases I'm using the performance improvements are in the neighborhood of 11x native -> 6x native. The biggest current limitation is the lack of support for backward branches and calls - I don't expect to ever add support for calls, but ideally I can either add backward branches or introduce a mechanism for turning loop/try bodies into their own traces. The jiterpreter also currently ignores debugger opcodes, but it's possible to implement them. The biggest performance problem is that calling through function pointers to a dynamically linked function in browsers is just plain slow, so traces have to be relatively large to overcome that.
|
PR description updated to cover the new 'jiterpreter trampolines', which are jitted replacements for the AOT interp_entry wrappers. The now-sabotaged raytracer sample in the PR forces aot/interp transitions during execution due to runtime generics and delegates, so it shows wildly different timings in each of the 4 possible execution modes (regular AOT, AOT+trampolines, regular interpreter, interpreter+jiterpreter+trampolines). There are opportunities to JIT other wrappers to further improve the performance of this test in AOT mode, I think - it spends a lot of time on overhead in various places instead of actually doing work, and before I sabotaged it it was MUCH faster in AOT. Comparison timings follow (best time selected from a set of runs in each case). Note that if you are running it for measurement, do not open devtools until all your runs are done, because devtools deoptimize JS and wasm code. If you want to profile it, start the profiler and then reload the page while the profiler is running.
|
I reworked the size threshold logic used to select where to insert traces, and implemented a C-side heuristic that is able to make fairly accurate guesses about where traces should go which reduces the overhead introduced by the jiterpreter when it fails to JIT a method. The timings for the toy raytracer testcase are still roughly as above, but the addition of the heuristic means less time is spent generating code in the runtime test suite, so the jiterpreter is now a measurable perf improvement for it. Some summary timings for fresh runs of the toy raytracer, pavel's more complex raytracer sample (w/o SIMD), and the system.runtime test suite (which currently is incompatible with AOT):
(edited to add more timings) |
8d16571
to
9b93f14
Compare
I would love this to be optional part of |
With the options interface in this PR, we can set the default to off and then you use a runtime flag to turn it on. See the |
I mean, to link it out via rollup. |
Here are some updated timings for various scenarios, since some of the performance characteristics have changed since the previous timings (a few key reasons are that batching the JIT operation can cause slower code to run for a period of time - this slows down console tests but not most real applications, and that I had to revert the inlining threshold). System.Runtime.Tests now passes with AOT enabled, but while the jiterpreter is active for it, the fact that the suite runs synchronously means that the improvement is small enough to be within the margin of error, so I omitted it along with a couple others that were ~100%. (Nothing relevant got slower)
|
The latest commit repurposes the little toy .wasm module used for WASM EH support detection to function as an optimized replacement for I verified that all This new optimization works even if you disable the rest of the jiterpreter. |
My current plan for landing this is to disable the jiterpreter JIT features (traces, jit_call wrappers, and interp_entry wrappers) and only land it with the new optimized jit call exception handling path enabled, since it provides a good speed improvement and is lower risk. That lets us flush out any unexpected issues with JITing wasm this way along with the parts of the scaffolding that are always enabled (like the options engine, feature detection, etc). Then in a followup PR I can flip the jiterpreter JITs to enabled so that it's easy to revert that part without having to change tons of files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to add these as options to utils/options-def.h. It would auto generate the argument parsing code, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I saw while looking at the header I wouldn't be able to use it to do this, but I'll give it another try.
The mono changes look ok to me otherwise. |
Jiterpreter traces work (yay alignment) Use function pointers to invoke jiterpreter traces Implement binops, basic arithmetic, and some unary ops cknull Strlen ldloca_s + print statistics at exit getchr, track bailout count Use a JS dispatcher for traces because emscripten addFunction is bad. Use unchecked memory setters in traces. Enable tracing for more code. Checkpoint ldfld Checkpoint stfld Checkpoint: F32 arithmetic Add enable flag Move back to addFunction because while the overhead of adding the trace functions is higher, the invoke performance is better Checkpoint: Traces continue past brfalse/brtrue if the branch is not taken Checkpoint wasm generation Generates dummy traces Checkpoint: Works again Optimized trace generator Traces work again Fixed codegen for memmove and ldloca Use wasm memmove instruction Implement cknull Implement brfalse and brtrue Refactor null checks stloc refactoring Basic ldfld/stfld support Checkpoint ldind (broken) ldind works ldfld_o stfld_o ldfld_vt Unary math intrinsics and ldlen/ldelem_ref Fix math intrinsics, add some i8 operations Working (u|i)32 -> i64 promotion stind Add fp remainder operation, adjust thresholds to avoid making tests way slower Fix remainder codegen, implement mono_ldptr ldelema1 ldelema1 error handling ldelem for ints and floats Checkpoint trace enter optimization (broken) Generate moves instead of a memmove for small constant-sized memmoves More accurate trace length measurement, higher trace length requirement for better perf Check in simple raytracer sample code De-inline intersection test because the branches cause lots of trace bailouts add_iN_imm and mul_iN_imm ldc_i8 for small constants 64-bit relops and a few 64-bit superinsns Checkpoint relop branches Fixed relop branches Specify alignment of ldloc/stloc where possible Fix imm add/mul opcodes and improve debugging Pass locals to traces instead of frame. Hard-code data items into the trace instead of loading them. Pass method name to prepare_jiterpreter, include method name in trace name, add flag to trace method name when aborts happen. Add Min and Max interpreter opcodes Try to keep interpreter tiering working Add abs interp opcode, add missing float32 interp opcodes. Fix codegen for f32 pow/min/etc. Don't generate additional jiterpreter enter opcodes when inlining. Fix a warning Fix more warnings, don't pass ip to traces Checkpoint work on static fields; fix unbalanced stack on stfld Static fields work Support for generating trace entry points at the target of all backward branches. When bailing out for a branch, add the displacement to the bailout target. Restructure the simple raytracer Categorize bailouts and count each type separately when counting is active Integer divide and remainder Implement MUL_OVF_I4 and MUL_OVF_UN_I4. Fix the getelema failure check being inverted. Add support for more precise tracking of what method calls cause traces to abort Implement INITOBJ Implement STIND_REF i4 to u1/u2 conversions Add i4->i2 and i4->i1 conversions Fix min and max interp opcodes Detect when a trace is approaching the browser wasm size limit (4kb) and abort it so that it doesn't fail to compile Debugging and logging improvements Optimize generated module boilerplate for size Rework branch patch point system to be less sloppy Handle different branch depths Checkpoint new approach to branches Fix ldc failures unbalancing the stack. Fix some branch problems Implement GETCHR and STRLEN Fully qualified trace names Move some stuff into a new file Optimize some null checks to not perform a double memory load Cache i4 mul/div/rem operands in locals instead of re-loading them from memory after range checks Less aggressive 'a branch can never skip past this' abort for calls Fix math operand caching Fix ldlen Improved error handling and instrumentation u4 -> i4 with underflow check Shifts with immediates ldind with constant offset Re-enable stelem Add ldlen_span, fix incorrect base in trace error messages newobj_vt_inlined and newobj_inlined ldc_i8 for full value range ldobj_vt and cpobj_vt stobj_vt br(true|false)_i8_s and ldftn_addr Fix local index assignment. implement unsigned i8 div and rem intrins_span_ctor Improved instrumentation for estimated opcode importance, implement more intrinsics If a method has the Intrinsic attribute, treat it as if AggressiveInlining were set in the interpreter. This makes System.Numerics.Vectors types much faster. Fix remainder opcodes. Implement castclass and isinst Implement INTRINS_MEMORYMARSHAL_GETARRAYDATAREF. Add partial support for safepoint branches and with-immediate relop branches ldelem_vt Replace JS remainder implementation with a wrapper around C fmod. Implement atan2. Fix casing so aot-cross can be located during build of samples Fix i32/i64 type mismatch for an opcode. Rebase damage fixes. When rejecting a trace, set its abort reason to trace-too-small. Implement r8 conditional branches. Implement ldind_offset stfld_vt Checkpoint: Easier enable/disable for jiterpreter, mess up the raytracer sample so that it causes AOT to bail into the interp Implement MINT_UNBOX, fix counters Optimize some jiterpreter entry points by moving them into interp.c so they can be inlined Don't call stackval_from_data for pointers Checkpoint backwards branch support Backwards branch entry points seem to work now Rename various jiterpreter entry points and make sure they don't use external APIs or volatile Add some missing indirect opcodes and rework the indirect code Fix negative constant indirect offsets Implement MINT_BOX Adjust thresholds and add some missing offset opcodes Code cleanup Move various jiterpreter code to its own .c file Checkpoint Checkpoint Add heuristic filter to avoid inserting trace enter opcodes in places where they are unlikely to produce a trace Fix crash at startup with jiterpreter turned off Branch block heuristic fixes When estimating trace lengths, don't count opcodes like nop and dummy_use Lower trace length threshold since it ignores dummy opcodes now, adjust test benchmark to be slower, prevent leb being linked out when jit wrappers are enabled Repair rebase damage eslint and wasi build fixes --amend C4206 workaround Diff cleanup Diff cleanup Fix jiterp opdefs Checkpoint: Move most jiterpreter options into c so they can be accessed from everywhere Move the opcode abort counts into C and update them in the heuristic abort logic as well. More options improvements Revert inlining limit to see if it fixes the crash; fix osx build Manually enable jiterpreter stats in browser-bench lint fix; implement ld_delegate_method_ptr since it shows up in browser bench wrappers Disable jiterpreter and wrapper jit by default if threading is enabled. Update comments. Checkpoint code cleanup Code cleanup Code and project file cleanup Implement castclass_interface and isinst_interface Implement localloc and newarr. Pass frame pointer to traces. CI build fix Rework call target tracking so it's more useful. Move jiterpreter APIs to INTERNAL. Implement add_ovf for i4 and u4, implement leave_s, improve usefulness of hottest failed traces output Implement LDTSFLDA, improve trace list filtering Implement BOX_VT Unify overflow check and castclass/isinst helpers to reduce number of imports. Implement more conversions and cast/isinst variants Unify overflow conversions and implement a few more. Implement i8 immediate comparison branches. Fix overflow in bailout counting Checkpoint: add jit for specialized do_jit_call trampolines Checkpoint Optimized out wasmtable.get overhead when invoking jit call wrappers Revert test changes Checkpoint repairing merge damage Temporarily disable getitem_span since it changed Implement getitem_span and getitem_localspan CI build fixes Address PR feedback Address PR feedback Address PR feedback Address PR feedback Refactor import section generation; checkpoint non-indirect jit call wrappers Use direct call instead of indirect call for jit call wrappers. Optimize some of the __linker_exports generated wrappers to use .call instead of .apply so there is less runtime overhead on C-to-JS calls Improved linker export wrappers for lower c-to-js overhead Replace linker export wrappers with the actual export functions at startup to remove overhead Remove debug print and fix infinitely growing import list Revert lib.js changes since the optimized wrapper complexity is not needed anymore lint fix for CI build Fix computed goto non-wasm interpreter build due to jiterpreter opcodes Implement LEAVE and BR alongside the short versions Implement support for wasm exception handling in jit_call wrappers Aligned memops in a few key places Remove MINT_NEWARR support since it appears to break NDPin test and it's not terribly important to have. Fix instrumented method support Remove instrumented method JIT do_jit_call wrappers in groups if possible instead of one at a time. If multiple do_jit_call cinfos share the same target function, share a single wrapper Improve error message for missing lazy cwraps Lower initial heap size for browser-bench because the default of 512mb causes OOM during the appstart tests Also generate groups of interp_entry wrappers in one go instead of generating one module each Repair merge damage Fix name collision when generating multiple interp entry wrappers at once Inline stackval_from_data logic for common types into interp_entry wrappers Optimize jitcall hit counter Fix jitcall counter continuing to increment even if jit fails Detect whether the runtime supports WASM EH and if not, disable it automatically Improve formatting of log messages Code cleanup Fix crash due to not handling the unbox flag for the this-reference in interp_entry wrappers Rename some of the jiterpreter configuration bools Add a missing jiterpreter config switch Don't rely on atob for the wasm EH support probe, just convert a hex literal by hand instead Clean up some options, adjust some thresholds, and add a couple new comments Code cleanups, rearrangement, and add comments Add support for optimized JS and WASM EH based implementations of mono_llvm_cpp_catch_exception Formatting improvements
// multiple times and waste some work. At present this is unavoidable because | ||
// control flow means we can end up with two traces covering different subsets | ||
// of the same method in order to handle loops and resuming | ||
gboolean should_generate = enabled && should_generate_trace_here(bb, td->last_ins); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code around here operates on the wrong assumption that interp instructions are linked together from the first all the way to the last instruction in the method which is wrong. Instructions are linked together only within the same basic block. td->last_ins
is pretty much useless and I should probably get rid of it at some point since it is bad design (after an interp_add_ins
we can access the added instruction from td->last_ins
so we set data into it)
// A preceding trace may have been in a branch block, but we only care whether the current | ||
// trace will have a branch block opened, because that determines whether calls and branches | ||
// will unconditionally abort the trace or not. | ||
gboolean inside_branch_block = FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This serves no purpose currently. Once we hit a branching opcode the bblock ends and we stop the iteration. If you want to continue probing the trace after a branch you should continue with bb->next_bb
and restart with first_ins
from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this used to do something but it doesn't anymore. I'll figure out whether it's dead code to be removed or whether I need to fix it, since the heuristic is working pretty well as-is.
|
||
// Keep this file in sync with mintops.def. The order and values need to match exactly. | ||
|
||
export const enum MintOpcode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mintops.def
changes very often. Can't we do anything to simply include it ? Or have this automatically generated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could probably include it at build time and text parse it at runtime, but that would break the typescript type checking (which has caught quite a few bugs). I could probably also figure out some way to generate this during the build process. I'll look into it.
Note that if the mintops change is actually changing the implementation of an opcode, that'll break the jiterpreter unless it's also updated. Do you have thoughts on how that should be handled? The easiest solution is to disable the opcode until someone who knows typescript can update it and re-enabled, but it means that modifying the interp requires potentially modifying the jiterp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible for the runtime (src/mono/mono) build to create a .ts file using some C preprocessor steps (or, worst case, using an MSBuild task). and then the wasm build (src/mono/wasm/runtime) can just consume the .ts file from elsewhere.
Note that if the mintops change is actually changing the implementation of an opcode, that'll break the jiterpreter unless it's also updated. Do you have thoughts on how that should be handled?
You mean if the semantics of an opcode change? I think we should probably just handle it ad-hoc and disable the opcode.
This PR introduces a "jiterpreter" just-in-time WASM compiler for interpreter opcodes along with a set of specialized WASM JIT compilers to improve the performance of
interp_entry
anddo_jit_call
. The result is significantly improved performance for pure-interpreter workloads along with measurable speedups for interp->aot and aot->interp transitions in mixed mode AOT applications.The jiterpreter can be thought of as an additional layer on top of interpreter tiering, and works by inserting "trace entry points" into interpreter opcode streams. Once an entry point is reached often enough, the JIT kicks in to compile as many following interpreter opcodes into WASM as possible. Once a jitted 'trace' is available, when the interpreter reaches the entry point it can invoke the trace to natively execute dozens of interpreter opcodes at once without loop dispatch overhead. Any trace entry point that fails to JIT is replaced with a special NOP.
See https://paper.dropbox.com/doc/Jiterpreter-For-WebAssembly--BqLn88GjC3z4GDuE09vamtw4Ag-mJ5TRc7xF1PNAGMgwU49K for a more detailed overview of the jiterpreter design.
Due to browser limitations it is not possible to JIT entire methods at once, so entry points have to be placed at key locations within a given method, like immediately after a backwards branch target (for loop bodies). The jiterpreter has a simple single-pass compiler, so when it encounters an opcode or complex pattern it can't handle it 'bails out' to return control to the interpreter. Complex control flow like exceptions, calls and returns are also handled by bailing out to the interpreter. Traces share the interpreter's state and locals instead of maintaining their own, which allows seamless bailouts and maintains existing GC/thread safety characteristics.
The jiterpreter infrastructure is also used to on-demand compile specialized wrappers for
interp_entry
anddo_jit_call
. All the critical reflective/dynamic work (like enumerating parameter types) is done at compile time to produce a small, branchless dispatcher that is able to set up an interpreter stack frame or invoke native code with fewer copies than the generic wrappers currently used. To determine whichdo_jit_call
sites should be optimized, we maintain a hit counter much like we do for traces, and only JIT the most frequently executed ones. Forinterp_entry
we optimize all of them.An additional optimization in this PR leverages the jiterpreter infrastructure to optimize all
do_jit_call
execution by generating a dedicated function (either in WASM or JS) to trap exceptions and dispatch native calls. The current implementation usesmono_llvm_cpp_catch_exception
to perform an indirect call bounced through JavaScript, which adds a measurable amount. By using specialized code we can perform a direct call using WASM Exception Handling (if available) or using a specialized JavaScript function that doesn't do any indirection or table lookups. This optimization by itself speeds up System.Runtime.Tests in AOT mode by about 2 seconds. You can find the function template in the newdo-jit-call.wat
anddo-jit-call.wasm
files.The jiterpreter's features are configurable using a set of new runtime options, the key ones being:
--jiterpreter-enable-traces
Enables the insertion of trace entry points in interpreter functions and JIT compilation for those entry points.
--jiterpreter-enable-interp-entry
Enables the generation of specialized interp_entry wrappers. Note that this option is not retroactive, and these are typically all generated at app startup.
--jiterpreter-enable-jit-call
Enables the generation of specialized do_jit_call wrappers.
--jiterpreter-enable-stats
Enables dumping detailed statistics on the jiterpreter, including how many WASM functions were jitted, which interpreter opcodes caused the most failures, and how much time was spent. If statistics are enabled they are automatically dumped at app exit, but they can be manually dumped using an INTERNAL api.
This PR also contains some scaffolding changes and interpreter optimizations that were necessary in order to finish it and test it on benchmark scenarios: