KEMBAR78
Vectorize Convert.FromHexString by EgorBo · Pull Request #82521 · dotnet/runtime · GitHub
Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 23, 2023

Vectorize Convert.FromHexString (UTF-16) using "Algorithm 3" in http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html by Geoff Langdale and Wojciech Mula

Motivation

Convert.ToHexString is accelerated while Convert.FromHexString is not. Also, this PR's UTF8-based impl can be re-used for other things like Guid parsing (both UTF16 and UTF8) to e.g. speed up json parsing with guids inside.

Benchmark

public static IEnumerable<string> HexTestData()
{
    yield return "ABCD";
    yield return "01234567";
    yield return "0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
}

[Benchmark]
[ArgumentsSource(nameof(HexTestData))]
public byte[] FromHexString(string hex) => Convert.FromHexString(hex);
|        Method |                   Toolchain |                  hex |       Mean | Ratio |
|-------------- |---------------------------- |--------------------- |-----------:|------:|
| FromHexString |      \Core_Root\corerun.exe |                 ABCD |  11.549 ns |  0.97 |
| FromHexString | \Core_Root_base\corerun.exe |                 ABCD |  11.926 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |             01234567 |  15.492 ns |  0.96 |
| FromHexString | \Core_Root_base\corerun.exe |             01234567 |  16.212 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |     0123456789ABCDEF |   9.597 ns |  0.39 |
| FromHexString | \Core_Root_base\corerun.exe |     0123456789ABCDEF |  24.550 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [32] |  11.541 ns |  0.31 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [32] |  37.090 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [64] |  15.928 ns |  0.21 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [64] |  75.480 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |  0123(...)CDEF [128] |  24.333 ns |  0.17 | ~6x improvement
| FromHexString | \Core_Root_base\corerun.exe |  0123(...)CDEF [128] | 140.145 ns |  1.00 |

Regression for small inputs is around noise, I assume it's hidden under allocation overhead

@ghost ghost assigned EgorBo Feb 23, 2023
@EgorBo EgorBo marked this pull request as ready for review February 23, 2023 20:56
@EgorBo
Copy link
Member Author

EgorBo commented Feb 24, 2023

This is ready for review, arm64 is a bit slower but still shows x3 improvement on my M1.

PTAL @dotnet/area-system-memory

@ghost
Copy link

ghost commented Feb 27, 2023

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Vectorize Convert.FromHexString (UTF-16) using "Algorithm 3" in http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html by Geoff Langdale and Wojciech Mula

Motivation

Convert.ToHexString is accelerated while Convert.FromHexString is not. Also, this PR's UTF8-based impl can be re-used for other things like Guid parsing (both UTF16 and UTF8) to e.g. speed up json parsing with guids inside.

Benchmark

public static IEnumerable<string> HexTestData()
{
    yield return "ABCD";
    yield return "01234567";
    yield return "0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
}

[Benchmark]
[ArgumentsSource(nameof(HexTestData))]
public byte[] FromHexString(string hex) => Convert.FromHexString(hex);
|        Method |                   Toolchain |                  hex |       Mean | Ratio |
|-------------- |---------------------------- |--------------------- |-----------:|------:|
| FromHexString |      \Core_Root\corerun.exe |                 ABCD |  11.549 ns |  0.97 |
| FromHexString | \Core_Root_base\corerun.exe |                 ABCD |  11.926 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |             01234567 |  15.492 ns |  0.96 |
| FromHexString | \Core_Root_base\corerun.exe |             01234567 |  16.212 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |     0123456789ABCDEF |   9.597 ns |  0.39 |
| FromHexString | \Core_Root_base\corerun.exe |     0123456789ABCDEF |  24.550 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [32] |  11.541 ns |  0.31 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [32] |  37.090 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [64] |  15.928 ns |  0.21 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [64] |  75.480 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |  0123(...)CDEF [128] |  24.333 ns |  0.17 | ~6x improvement
| FromHexString | \Core_Root_base\corerun.exe |  0123(...)CDEF [128] | 140.145 ns |  1.00 |

Regression for small inputs is around noise, I assume it's hidden under allocation overhead

Author: EgorBo
Assignees: EgorBo
Labels:

area-System.Memory

Milestone: -

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Feb 27, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Mar 3, 2023

Ping @dotnet/area-system-memory @stephentoub any interest in this? The final goal was to use this work for Guid parsing (should be a small change)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 29, 2023

Closing due to lack of interest after 3 pings

@EgorBo EgorBo closed this Mar 29, 2023
@stephentoub
Copy link
Member

I'm interested :)

@stephentoub stephentoub reopened this Mar 29, 2023
Comment on lines 292 to 308
if (Ssse3.IsSupported)
{
output = Ssse3.MultiplyAddAdjacent(nibbles,
Vector128.Create((short)0x0110).AsSByte()).AsByte();
}
else
{
// Workaround for missing MultiplyAddAdjacent on ARM
Vector128<short> ones = Vector128.Create(0x10010).AsInt16();
Vector128<short> vl = AdvSimd.Multiply(
AdvSimd.ZeroExtendWideningLower(nibbles.GetLower()).AsInt16(), ones).AsInt16();
Vector128<short> vu = AdvSimd.Multiply(
AdvSimd.ZeroExtendWideningLower(nibbles.GetUpper()).AsInt16(), ones).AsInt16();
output = AdvSimd.AddSaturate(
AdvSimd.Arm64.UnzipEven(vl, vu),
AdvSimd.Arm64.UnzipOdd(vl, vu)).AsByte();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a helper function for easier readability of the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure how to call that helper because aparently it's not a cross-platform MultiplyAddAdjacent so that's why I decided to keep it here. Maybe we'll simplify it for arm in #82521 (comment)

@tannergooding
Copy link
Member

Closing due to lack of interest after 3 pings

Sorry, this got missed in the general hectic period of recent area ownership changes. Looking now.

Comment on lines 299 to 307
// Workaround for missing MultiplyAddAdjacent on ARM
Vector128<short> ones = Vector128.Create(0x10010).AsInt16();
Vector128<short> vl = AdvSimd.Multiply(
AdvSimd.ZeroExtendWideningLower(nibbles.GetLower()).AsInt16(), ones).AsInt16();
Vector128<short> vu = AdvSimd.Multiply(
AdvSimd.ZeroExtendWideningLower(nibbles.GetUpper()).AsInt16(), ones).AsInt16();
output = AdvSimd.AddSaturate(
AdvSimd.Arm64.UnzipEven(vl, vu),
AdvSimd.Arm64.UnzipOdd(vl, vu)).AsByte();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not a better way to do this either? Arm64 has many specialized multiply instructions, including those that simultaneously widen or do addition or saturation.

@TamarChristinaArm might have some insight...


For reference, Vector128<short> MultiplyAddAdjacent(Vector128<byte> left, Vector128<sbyte> right) is PMADDUBSW. This effectively does the following:

DEST[0] = SaturateToSignedWord((short)(SRC[1] * DEST[1]) + (short)(SRC[0] * DEST[0]));
DEST[1] = SaturateToSignedWord((short)(SRC[3] * DEST[3]) + (short)(SRC[2] * DEST[2]));
...

That is, it takes an unsigned byte (left) and a signed byte (right). It multiplies these to produce a signed short. It then does a pairwise addition.


In the case of xarch, we're taking an unsigned byte (nibbles) of unknown value and a signed byte of known value 0x0110 which is effectively unsigned.

Since for x-bit * x-bit = y-bit the sign is only relevant when y-bit > x-bit, we can effectively treat this as an unsigned multiplication and then reinterpret it back as signed.

That is, we can at least do:

Vector128<byte> multiplier = Vector128.Create((ushort)0x0110).AsByte());
Vector128<short> vl = AdvSimd.MultiplyWideningLower(nibbles.GetLower(), multiplier.GetLower()).AsInt16();
Vector128<short> vu = AdvSimd.MultiplyWideningUpper(nibbles, multiplier).AsInt16()

This will allow us to remove a couple of the operations. I'm not aware of a trivial way to do a pairwise saturation here, however so we may need to keep that latter bit as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I did here is I took a general impl of MultiplyAddAdjacent and simplified it for the constant argument. I ran benchmarks on M1 and they overall demonstrated similar ratios as for x64. I also do think it's possible to make it more efficient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi :),

I'm on holidays till Monday, but Tanner's version is much better.
Unfortunately AArch64 lacks a lot of pairwise operations.

However from a quick look (so please double check the below before implementing) looking at the constant you used

Vector128<short> ones = Vector128.Create(0x10010).AsInt16();

This looks like it's multiplying even elements by 0x10 and odd by 0x1.
So evens are just shift left by 4 and odds just zero extend?

Atm the sequence from Tanner is:

movi + umull + umull + (uzp1 + uzp2) + sqadd

where the permutes run in parallel. So roughly 2 + 4 + 4 + 2 + 2 = 14 cycles on Neoverse-N1, 12 cycles assuming the movi got lifted.

Since you need the permute anyway and nibbles is unsigned, how about instead, create a zero vector outside the loop and do the permutes earlier to separate even/odd, but take advantage of this to get a zero extend free.

zip1 (with zero vector) + uzp2 + ushll 4 + sqadd

Which would be 2 + 2 + 2 = 6 since the permutes again run in parallel.

Should be a 2x speedup.

Copy link
Member Author

@EgorBo EgorBo Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TamarChristinaArm @tannergooding thanks for the suggestion! Did I get it correctly:

var zip1 = AdvSimd.Arm64.ZipLow(nibbles, Vector128<byte>.Zero); // zip1
var uzp2 = AdvSimd.Arm64.UnzipOdd(nibbles, nibbles); // uzp2
var ushl = AdvSimd.ShiftLeftLogicalWideningLower(uzp2.GetLower(), 4); // ushll 4
output = AdvSimd.AddSaturate(zip1, ushl.AsByte()); // sqadd

?
unfortunately it doesn't produce expected results 🙁

Copy link
Member

@tannergooding tannergooding Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure about the uzip2, but I think the principle is that because the multiplier is effectively multiplying even * 16 and odd * 1 and that we know both inputs are unsigned, we can do this:

var even = AdvSimd.Arm64.ZipLow(nibbles, Vector128<byte>.Zero).AsInt16();
var odd  = AdvSimd.Arm64.ZipHigh(nibbles, Vector128<byte>.Zero).AsInt16();
even = AdvSimd.ShiftLeftLogical(even, 4);
output = AdvSimd.AddSaturate(even, odd);

This gives us a 4 instruction sequence: zip1, zip2, shl, sqadd


This works because ZipLow and ZipHigh interleave the elements. That is ZipLow is:

for (int p = 0; p < Vector128<T>.Count / 2; p++)
{
    result[(2 * p) + 0] = left[p];
    result[(2 * p) + 1] = right[p];
}

While ZipHigh is:

for (int p = 0; p < Vector128<T>.Count / 2; p++)
{
    result[(2 * p) + 0] = left[p + 1];
    result[(2 * p) + 1] = right[p + 1];
}

This means we can get the effect of decoupling the even and odd elements while zero-extending at the same time.

We then just need to multiply the even results by 16, which is the same as shifting left by 4. We don't need to touch the odd results since they are just multiplied by 1.

Then we do the saturating add and are complete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's not quite right

ZipLow is doing lhs[0], rhs[0], lhs[1], rhs[1], ...
and we want lhs[0], rhs[0], lhs[2], rhs[2], ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work.. TransposeEven is the one that does lhs[0], rhs[0], lhs[2], rhs[2], ... (And odd indices for TransposeOdd)

var even = AdvSimd.TransposeEven(nibbles, Vector128<byte>.Zero).AsInt16();
var odd  = AdvSimd.TransposeOdd(nibbles, even.AsByte()).AsInt16();
even = AdvSimd.ShiftLeftLogical(even, 4);
output = AdvSimd.AddSaturate(even, odd);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work.. TransposeEven is the one that does lhs[0], rhs[0], lhs[2], rhs[2], ...

Args. yes sorry, flipped the permutes operation in my head. So should be:

var zip1 = AdvSimd.Arm64.TransposeEven(nibbles, Vector128<byte>.Zero); // trn1
var uzp2 = AdvSimd.Arm64.UnzipOdd(nibbles, nibbles); // uzp2
var ushl = AdvSimd.ShiftLeftLogicalWideningLower(uzp2.GetLower(), 4); // ushll 4
output = AdvSimd.AddSaturate(zip1, ushl.AsByte()); // sqadd

or @tannergooding 's version above.

we don't get two movni reg, #0 emitted for Vector128.Zero

The loop doesn't look that big, and I assume there's no loop invariant motion? It might be worth lifting the 5 constants out of the loop. You'll definitely have enough left over on Advanced SIMD, but am unsure about AVX/SSE. But worth trying to see what comes out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there's no loop invariant motion?

We do loop invariant motion, it's just never done for simple things like zero vectors. Even if I do hoisting by hands it will be propagated back at its use. The main issue that our RA is not perfect so we try to minimize cheap cases to occupy registers. We'll improve that eventually. Overall, most constant vectors are hoisted from this loop by JIT (at least till the IsAscii check)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue that our RA is not perfect so we try to minimize cheap cases to occupy registers.

Ah that's fine, Zero is typically free anyway, the movi idiom is handled at rename. So should only really be an issue in cases where it prevents you from squeezing stuff inside a single fetch block. So no reall priority, if the complicated ones are hoisted :)

Co-authored-by: Günther Foidl <gue@korporal.at>
@EgorBo
Copy link
Member Author

EgorBo commented Mar 29, 2023

I'm interested :)

Sorry, this got missed in the general hectic period of recent area ownership changes. Looking now.

No problem! I was just trying to minimize number of open PRs. I quite often close my PRs and then re-open them after a while and finish them 🙂

@EgorBo
Copy link
Member Author

EgorBo commented Mar 31, 2023

@tannergooding anything else on this PR? I assume we can optimize more things for arm64 specifcally here but it's a trade-off between code readability/portability and perf?

@tannergooding
Copy link
Member

I think what we have is good now, and makes up for the perf difference of xarch having 1 instruction vs Arm64 having 4

@EgorBo EgorBo merged commit f1b32fe into dotnet:main Mar 31, 2023
@EgorBo EgorBo deleted the vectorize-from-hex branch March 31, 2023 19:02
@EgorBo
Copy link
Member Author

EgorBo commented Mar 31, 2023

Failure is #83655

@ghost ghost locked as resolved and limited conversation to collaborators May 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants