-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Closed
Labels
area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone
Description
In x64, memset and memmove is forwarded to the CRT implementation as seen below:
| jmp memset ; forward to the CRT implementation |
| jmp memmove ; forward to the CRT implementation |
However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.
Here is the benchmark run difference between x64 (base) and arm64 (diff)
Here is the x64 code for CopyBlock128() benchmark that just uses memcpy:
G_M19447_IG03:
lea rcx, bword ptr [rsp+08H]
lea rdx, bword ptr [rsp+88H]
mov r8d, 128
call CORINFO_HELP_MEMCPY
inc edi
cmp edi, 100
jl SHORT G_M19447_IG03But Arm64 unrolls the loop to do so
G_M19447_IG03:
ldr x1, [fp,#152]
str x1, [fp,#24]
ldp q16, q17, [fp,#160]
stp q16, q17, [fp,#32]
ldp q16, q17, [fp,#192]
stp q16, q17, [fp,#64]
ldp q16, q17, [fp,#224]
stp q16, q17, [fp,#96]
ldr q16, [fp,#0xd1ffab1e]
str q16, [fp,#128]
ldr x1, [fp,#0xd1ffab1e]
str x1, [fp,#144]
add w0, w0, #1
cmp w0, #100
blt G_M19447_IG03I will perform some experiments and update the results here.
Metadata
Metadata
Assignees
Labels
area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMICLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
