[CPU] Fix ARM float32 fmsub

### 🐛 Describe the bug

In `vec_base.h`, the [fmsub](https://github.com/pytorch/pytorch/blob/916e8979d3e0d651a9091732ce3e59da32e72b0e/aten/src/ATen/cpu/vec/vec_base.h#L987) function implements the functionality of `a * b - c`. However, in `vec128_float_neon.h`, it uses [vfmsq_f32](https://github.com/pytorch/pytorch/blob/916e8979d3e0d651a9091732ce3e59da32e72b0e/aten/src/ATen/cpu/vec/vec128/vec128_float_neon.h#L543). According to the [manual](https://developer.arm.com/architectures/instruction-sets/intrinsics/#f:@navigationhierarchiesreturnbasetype=[float]&f:@navigationhierarchiessimdisa=[Neon]&q=vfmsq_f32), for the input order `c, a, b`, it implements `c - a * b`, which results in the opposite outcome.

Below is the testing environment:
```plaintext
# lscpu
Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                HiSilicon
  BIOS Vendor ID:         QEMU
  Model name:             Kunpeng-920
    BIOS Model name:      virt-rhel8.2.0  CPU @ 2.0GHz
    BIOS CPU family:      1
    Model:                0
    Thread(s) per core:   1
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             0x1
    BogoMIPS:             200.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jsc
                          vt fcma dcpop asimddp asimdfhm
```

The running result of `./build/bin/vec_test_all_types_DEFAULT`:
```plaintext
[ RUN      ] BitwiseFloatsAdditional/0.Fmsub
/root/pytorch/aten/src/ATen/test/vec_test_all_types.h:880: Failure
Expected equality of these values:
  nearlyEqual<UVT>(expArr[i], actArr[i], absErr)
    Which is: false
  true
24127.314453125!=-24127.314453125
Failure Details:
fmsub "/root/pytorch/aten/src/ATen/test/vec_test_all_types.cpp":940
Test Seed to reproduce: 1742184039034921432
Arguments:
#        vec[-108.478317, 430.048676, 439.19342, 111.896461]
#        vec[443.884338, 151.219467, -189.899826, -492.905579]
Expected:
#       vec[24127.3145, 257714.359, -55222.8984, 15263.2383]
Actual:
#       vec[-24127.3145, -257714.359, 55222.8945, -15263.2383]
First mismatch Index: 0
```

Modify the NEON implementation to:
```c++
template <>
Vectorized<float> inline fmsub(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
  return Vectorized<float>(vnegq_f32(vfmsq_f32(c, a, b)));
}
```

The running result of `./build/bin/vec_test_all_types_DEFAULT` after the modification:
```plaintext
[----------] 6 tests from BitwiseFloatsAdditional/0, where TypeParam = at::vec::DEFAULT::Vectorized<float>
[ RUN      ] BitwiseFloatsAdditional/0.ZeroMask
[       OK ] BitwiseFloatsAdditional/0.ZeroMask (0 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Convert
[       OK ] BitwiseFloatsAdditional/0.Convert (0 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Fmadd
[       OK ] BitwiseFloatsAdditional/0.Fmadd (78 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Fmsub
[       OK ] BitwiseFloatsAdditional/0.Fmsub (78 ms)
[ RUN      ] BitwiseFloatsAdditional/0.FmaddVecN
[       OK ] BitwiseFloatsAdditional/0.FmaddVecN (79 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Blendv
[       OK ] BitwiseFloatsAdditional/0.Blendv (0 ms)
[----------] 6 tests from BitwiseFloatsAdditional/0 (236 ms total)
```

### Versions

main

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Fix ARM float32 fmsub #149292

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CPU] Fix ARM float32 fmsub #149292

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions