KEMBAR78
[wasm] SIMD support improvements by radekdoulik · Pull Request #73289 · dotnet/runtime · GitHub
Skip to content

Conversation

radekdoulik
Copy link
Member

@radekdoulik radekdoulik commented Aug 3, 2022

Add internal PackedSimd class with wasm specific SIMD intrinsics methods.

The llvm code generator works nicely with them.
So that C#

    WasmBase.Constant(0xff11ff22ff33ff44, 0xff55ff66ff77ff88)

is compiled into wasm code

    v128.const 0xff11ff22ff33ff44ff55ff66ff77ff88    [SIMD]
This will need more work, as it crashes clang during 'WebAssembly
Instruction Selection' pass:

    WasmApp.Native.targets(353,5): error : 3.    Running pass 'WebAssembly Instruction Selection' on function '@corlib_System_Runtime_Intrinsics_Wasm_WasmBase_Shuffle_System_Runtime_Intrinsics_Vector128_1_byte_System_Runtime_Intrinsics_Vector128_1_byte_System_Runtime_Intrinsics_Vector128_1_byte'
| measurement | no SIMD | SIMD |
|-:|-:|-:|
|                    Span, Reverse bytes |     0.0341ms |     0.0028ms |
|                    Span, Reverse chars |     0.0394ms |     0.0062ms |
length -= numIters * numElements * 2;
}
else if (Ssse3.IsSupported && (nuint)Vector128<byte>.Count * 2 <= length)
else if ((Ssse3.IsSupported || WasmBase.IsSupported) && (nuint)Vector128<byte>.Count * 2 <= length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, but using Vector128.Shuffle(tempFirst, Vector128.Create((byte)15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)); and Vector128.Shuffle(tempLast, Vector128.Create((byte)15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));

The Create calls do need to be "inline" due to a JIT limitation in .NET 7, but they will get optimized and hoisted regardless since we have the relevant GT_CNS_VEC support in RyuJIT (Mono LLVM has something similar).

namespace System.Runtime.Intrinsics.Wasm
{
[Intrinsic]
internal abstract class WasmBase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should take this (well #53730) to actual API review for .NET 8.

Provided that the relevant hookups exist for the xplat Vector128<T> (and Vector<T>) APIs, WASM will also be able to light up on any paths using those APIs prior to WASM support being directly exposed.

Copy link
Member Author

@radekdoulik radekdoulik Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided that the relevant hookups exist for the xplat Vector128<T> (and Vector<T>) APIs, WASM will also be able to light up on any paths using those APIs prior to WASM support being directly exposed.

IIRC we already handle xplat for Vector128<T> on wasm. Vector<T> is next for me to look into, I hope I can speedup string and other parts on wasm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info!

We're looking at moving more code to use Vector128<T> xplat APIs where possible so that we can support more platforms with less code overall (and only using the platform specific APIs where it gives significant improvement or where functionality can't be represented using the xplat APIs).

So supporting Vector128<T> will allow more WASM light-up faster.

@radekdoulik
Copy link
Member Author

Any more extensive perf numbers for these changes?

I can add more wasm measurents, if that is what you are looking for?

We have many paths which use the Vector128<T> (and several which still use Vector<T>) xplat APIs now and which could get WASM lightup provided the recognition existed.

We already have some of Vector128<T> intrinsics in place for wasm.

@tannergooding
Copy link
Member

I can add more wasm measurents, if that is what you are looking for?

I misunderstood exactly what lightup this was bringing at first. I thought it was adding Vector128<T> light up as well and so was expecting perf numbers for more than just Span<T>.

Now that I understand its just adding WasmBase and using that directly for Span<T> where we were already using xplat intrinsics, I think the numbers provided are fine.

@radical
Copy link
Member

radical commented Aug 5, 2022

@radekdoulik btw, you can use /azp run runtime-wasm-perf to run all the wasm benchmarks from dotnet/performance with this PR.


namespace System.Runtime.Intrinsics.Wasm
{
internal abstract class WasmBase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical for this PR, but just noting that the approved API surface differs slightly from what the proposal initially had: #53730 (comment)

WasmBase -> PackedSimd, Constant is cut as you should just use Vector128.Create(...) instead, most other things just had the parameter names updated from x, a, b, and imm to be things like left, right, value, index, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radekdoulik btw, you can use /azp run runtime-wasm-perf to run all the wasm benchmarks from dotnet/performance with this PR.

nice to have this pipeline. it would not measure these changes though, as they are hidden behind WasmSIMD property, which is disabled by default.

@ghost
Copy link

ghost commented Aug 11, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Add internal WasmBase class with wasm specific SIMD intrinsics methods.

Improve Span Reverse and IndexOf performance on wasm with SIMD

Measurements chrome/aot/amd64:

measurement no SIMD SIMD
Span, Reverse bytes 0.0332ms 0.0025ms
Span, Reverse chars 0.0332ms 0.0060ms
Span, IndexOf bytes 0.2068us 0.1002us
Span, IndexOf chars 0.0146ms 0.0028ms

Measurements firefox/aot/amd64:

measurement no SIMD SIMD
Span, Reverse bytes 0.0338ms 0.0022ms
Span, Reverse chars 0.0339ms 0.0048ms
Span, IndexOf bytes 0.2533us 0.1394us
Span, IndexOf chars 0.0201ms 0.0039ms

Code size reduction, the browser-bench sample: -5184 bytes (no SIMD 19539839 --> SIMD 19534655)

Author: radekdoulik
Assignees: radekdoulik
Labels:

arch-wasm, area-System.Runtime.Intrinsics

Milestone: -

@radekdoulik
Copy link
Member Author

Nice surprise to have the API approved! :-) I have updated the PR to reflect the changes and removed the span helper changes, as @adamsitnik's change uses the Vector128 now. The perf is not yet where it was, I will address that with next PR.

I am moving the old measurements here from PR description to keep them for later comparison.

Improve Span Reverse and IndexOf performance on wasm with SIMD

Measurements chrome/aot/amd64:

measurement no SIMD SIMD
Span, Reverse bytes 0.0332ms 0.0025ms
Span, Reverse chars 0.0332ms 0.0060ms
Span, IndexOf bytes 0.2068us 0.1002us
Span, IndexOf chars 0.0146ms 0.0028ms

Measurements firefox/aot/amd64:

measurement no SIMD SIMD
Span, Reverse bytes 0.0338ms 0.0022ms
Span, Reverse chars 0.0339ms 0.0048ms
Span, IndexOf bytes 0.2533us 0.1394us
Span, IndexOf chars 0.0201ms 0.0039ms

Code size reduction, the browser-bench sample: -5184 bytes (no SIMD 19539839 --> SIMD 19534655)

@radekdoulik
Copy link
Member Author

I think I addressed all things from reviews. Is there anything else to change? I would like it to merge before it becomes outdated again ;-)

@radical
Copy link
Member

radical commented Aug 25, 2022

/azp run runtime-wasm,runtime-wasm-perf

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@radical
Copy link
Member

radical commented Aug 25, 2022

Running runtime-wasm-perf just as a sanity check for the perf pipeline.

@lewing lewing requested review from tannergooding and vargaz August 30, 2022 16:17
@lewing lewing merged commit 4190ef8 into dotnet:main Aug 31, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Sep 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants