Suboptimal codegen for memset/memcpy unrolling

When JIT unrolls memset/memcpy it does suboptimal decisions for certain sizes, e.g. to memset 30 bytes:

```csharp
struct MyStruct
{
    fixed byte a[30];
}

MyStruct Test()
{
    MyStruct s = default; 
    return s;
}
```
```asm
  xor      eax, eax
  vxorps   xmm0, xmm0
  vmovdqu  xmmword ptr [rdx], xmm0
  mov      qword ptr [rdx+10H], rax
  mov      qword ptr [rdx+16H], rax
```
so to zero 30 bytes it uses GPR twice. It's better to keep using SIMD and overlap with previously zeroed part:
```asm
  vxorps   xmm0, xmm0, xmm0
  vmovups  xmmword ptr [rdx+ 14], xmm0
  vmovups  xmmword ptr [rdx], xmm0
```
Etc for other sizes.

PS: it seems that arm64 is doing the right thing here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suboptimal codegen for memset/memcpy unrolling #83277

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suboptimal codegen for memset/memcpy unrolling #83277

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions