-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[wasm] Add Vector128 and PackedSimd support to the jiterpreter; add PackedSimd to the interpreter #82773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsThis updates the jiterpreter to implement some of the new SIMD opcodes that were added to the interp, which removes the need for most traces using those opcodes to bail out. It operates in one of two modes:
I probably need to add a dedicated feature detection path that compiles a test wasm module containing v128 opcodes to sense the availability of SIMD, but the automated fallback may be good enough. Right now the C functions are invoked using call_indirect, so they're virtual calls. They could be static if I overhaul jiterpreter-support to allow defining new imports during code generation. On the other hand, the interpreter also does a virtual call, so this is at least not any worse. Not all of the SIMD opcodes are implemented directly, so even with v128 opcodes turned on some of them will be dispatched to the interpreter's C. I stuck to ones I could be relatively confident I was implementing correctly, since the wasm spec's documentation for its vector opcodes is written in cipher. This PR also updates genmintops to generate SIMD tables. There is some stray merge damage left in the PR right now I need to figure out how to resolve, and I need to decide whether the v128 opcodes should be enabled by default when merging this. I'm inclined to leave them switched off, so the baseline is just calling the interp C helpers and we can see if that's an improvement by itself.
|
4ade1c6
to
6af7e2e
Compare
0cea840
to
9c07f4f
Compare
browser-bench timings with this rebased on main:
My opinion is pretty mixed based on this. It looks like there are some workloads in our bench suite here that vectorize, and the performance improves (especially if we enable native use of wasm v128 opcodes). But a lot of stuff seems to hold steady or even be worse - presumably cases where the interpreter intrinsics support isn't broad enough, so saying 'v128 is hardware accelerated'! hurts us. Not sure what the right approach to take is with this. Maybe if we had wider support for Vector128 at the interpreter intrinsic level we'd see more gains? @BrzVlad what do you think? I think for now it makes sense to leave |
@kg I don't think the way you test is correct, since the numbers make no sense. It is like you are testing the same thing. What does Also I don't think we should bother with perf investigations of this until we get some reference numbers from the microbenchmark suite on desktop interp |
I enable interp SIMD for the first two columns by changing I'll retest by editing the browser-bench csproj to add that property; because SIMD had been enabled on main I thought it would be sufficient to set the constant in the interp. If the code is being trimmed out that would explain not seeing much of a difference - the opcodes I'm seeing must be for stuff that escaped trimming. |
I rebuilt the runtime with |
@kg I merged main can you double check the legacy_cwraps bit |
Looks right |
0cc7e5d
to
9bfd63d
Compare
I still have no explanation for why this performs the way it does.
A handful of things like Span.Reverse and Vector.Normalize are obviously improved, some are worse like Span.IndexOf, and some of these timing samples just make no sense and make me wonder if browser-bench is broken on my hardware. EDIT: Ah, the ones that look way wrong are a msec/usec disconnect. Wish it didn't do that. |
I think it is best to investigate why |
The custom conditional select and bitwise equality/inequality opcodes in the jiterpreter side are broken somehow, so I've disabled them. I'm not sure what's going on there since tests pass, but unless they're disabled the beginning of xharness output XML ends up going missing. This means that the interpreter C helpers get used instead, which isn't the end of the world. |
8d8a71d
to
f97e897
Compare
This is now blocked on real failures in the vectorized IndexOfAny code that didn't run before due to lack of PackedSimd. I am unable to come up with a satisfactory explanation for why it's not working. (Incidentally, the set of tests that fail locally is different from the set that fail on CI, and the values are different, which makes me think some sort of memory corruption or uninitialized memory.)
While I'm mildly suspicious of the vectorized code since it was previously being guarded, I don't see anything wrong with it. It's possible we're now exercising some part of the interp v128 code that we weren't before, but it doesn't seem to heavily use those APIs. I double and triple checked that I am using the correct clang intrinsic for the relevant opcode, and even experimented with using different intrinsics anyway in case I misread something. It also fails even if jiterpreter is disabled, so that isn't a factor. |
Disabling the interp v128 intrinsics doesn't seem to fix it either (I verified that doing this results in mostly-scalar code with wasm packedsimd opcodes scattered inside). So that seems to indicate that somehow the PackedSimd operation(s) are the problem. |
Radek pointed out that this failure could be #84775 . It could be that there is something wrong with the way interp aligns v128 arguments or return values, or my copy-paste produced some incorrect alignment. |
Fix bugs in the jiterpreter-support import management implementation Enable interpreter SIMD for WASM Minor cleanups Fix merge Checkpoint PackedSimd Implement a few PackedSimd methods in the interp Add packedsimd ops to jiterpreter Move the interp simd opcode -> wasm opcode mapping into the C tables Add intrinsic ids for most of the remaining packedsimd methods Map most of the PackedSimd methods to intrinsics Update genmintops Fix merge damage Fix build Add more wasm opcodes Add more wasm opcodes Add bitmask intrinsics Add missing opcodes to transform-simd Use HOST_BROWSER instead of HOST_WASM to fix wasi build Implement the pack-n-elements vector instructions Disable bitselect because it's broken somehow Simplify vector packing Add browser-bench measurements for packing vectors Disable more opcodes Disable more opcodes Fix PackedSimd feature detection on non-wasm targets Maybe fix linux interp assertion Don't fail transform for unsupported PackedSimd methods on non-browser targets Fix i64 popcnt Add basic R4 v128 intrinsics (add/sub/div/mul) Re-enable more jiterpreter simd
I added some basic R4 vector ops to the interp and changed some asserts so that it's possible to run the SIMD version of pavel's raytracer. It completes in ~16k msec with these changes compared to ~31k msec before, nearly a 50% reduction. The crash issues and test failures I was hitting before seem to have at least partially been caused by an arglist alignment issue that is now fixed on main, so hopefully tests will pass now... |
The remaining issue on this PR seems to be that once we report PackedSimd support, writing XML for the testResults.xml file ends up dropping a few characters at the start or end of the stream. It's not clear to me why this would happen. |
Can this be turned off by default so the PR can be merged ? |
I could probably comment out the PackedSimd support, but it makes this PR much less useful since the BCL uses it now. I'm fine with doing that assuming none of the CI tests fail and the only remaining problem is the xml, since it means we're not merging tremendously broken code. |
All the failures I saw were the test results xml size problem, so I added runtime options to govern interp v128 and packedsimd support and set them both to FALSE for browser. |
This updates the jiterpreter to implement simd intrinsics that were added to the interp, which removes the need for most traces using those opcodes to bail out. It operates in one of two modes:
Not all of the SIMD opcodes are implemented directly, so even with v128 opcodes turned on some of them will be dispatched to the interpreter's C. I stuck to ones I could be relatively confident I was implementing correctly, since the wasm spec's documentation for its vector opcodes is written in cipher.
Along with this I added interp opcodes for most of the PackedSimd API, with C interp intrinsics auto-generated for each one by wrapping clang's matching intrinsics.
This PR also updates genmintops to generate SIMD tables, and inlines trace entry logic because we saw a performance regression when it was outlined.
This PR also improves and unifies the imported function logic in jiterpreter-support so it's easier to work with.