KEMBAR78
GitHub · Where software is built
Skip to content

Arm64: Forward memset/memcpy to CRT implementation  #67326

@kunalspathak

Description

@kunalspathak

In x64, memset and memmove is forwarded to the CRT implementation as seen below:

jmp memset ; forward to the CRT implementation

jmp memmove ; forward to the CRT implementation

However, in Arm64, they are hand written in assembly as seen in https://github.com/dotnet/runtime/blob/2453f16807b85b279efc26d17d6f20de87801c09/src/coreclr/vm/arm64/crthelpers.asm. Experiment if CRT implementation of memset/memmove for Arm64 is faster and if yes, just use it. We might also need to readjust the heuristics that we do today to unroll the copy block.

Here is the benchmark run difference between x64 (base) and arm64 (diff)

perf_diff

Here is the x64 code for CopyBlock128() benchmark that just uses memcpy:

G_M19447_IG03:
       lea      rcx, bword ptr [rsp+08H]
       lea      rdx, bword ptr [rsp+88H]
       mov      r8d, 128
       call     CORINFO_HELP_MEMCPY
       inc      edi
       cmp      edi, 100
       jl       SHORT G_M19447_IG03

But Arm64 unrolls the loop to do so

G_M19447_IG03:
            ldr     x1, [fp,#152]
            str     x1, [fp,#24]
            ldp     q16, q17, [fp,#160]
            stp     q16, q17, [fp,#32]
            ldp     q16, q17, [fp,#192]
            stp     q16, q17, [fp,#64]
            ldp     q16, q17, [fp,#224]
            stp     q16, q17, [fp,#96]
            ldr     q16, [fp,#0xd1ffab1e]
            str     q16, [fp,#128]
            ldr     x1, [fp,#0xd1ffab1e]
            str     x1, [fp,#144]
            add     w0, w0, #1
            cmp     w0, #100
            blt     G_M19447_IG03

I will perform some experiments and update the results here.

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions