KEMBAR78
[CPU] Fix ARM float32 fmsub · Issue #149292 · pytorch/pytorch · GitHub
Skip to content

[CPU] Fix ARM float32 fmsub #149292

@zhangfeiv0

Description

@zhangfeiv0

🐛 Describe the bug

In vec_base.h, the fmsub function implements the functionality of a * b - c. However, in vec128_float_neon.h, it uses vfmsq_f32. According to the manual, for the input order c, a, b, it implements c - a * b, which results in the opposite outcome.

Below is the testing environment:

# lscpu
Architecture:             aarch64
  CPU op-mode(s):         64-bit
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                HiSilicon
  BIOS Vendor ID:         QEMU
  Model name:             Kunpeng-920
    BIOS Model name:      virt-rhel8.2.0  CPU @ 2.0GHz
    BIOS CPU family:      1
    Model:                0
    Thread(s) per core:   1
    Core(s) per socket:   16
    Socket(s):            1
    Stepping:             0x1
    BogoMIPS:             200.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jsc
                          vt fcma dcpop asimddp asimdfhm

The running result of ./build/bin/vec_test_all_types_DEFAULT:

[ RUN      ] BitwiseFloatsAdditional/0.Fmsub
/root/pytorch/aten/src/ATen/test/vec_test_all_types.h:880: Failure
Expected equality of these values:
  nearlyEqual<UVT>(expArr[i], actArr[i], absErr)
    Which is: false
  true
24127.314453125!=-24127.314453125
Failure Details:
fmsub "/root/pytorch/aten/src/ATen/test/vec_test_all_types.cpp":940
Test Seed to reproduce: 1742184039034921432
Arguments:
#        vec[-108.478317, 430.048676, 439.19342, 111.896461]
#        vec[443.884338, 151.219467, -189.899826, -492.905579]
Expected:
#       vec[24127.3145, 257714.359, -55222.8984, 15263.2383]
Actual:
#       vec[-24127.3145, -257714.359, 55222.8945, -15263.2383]
First mismatch Index: 0

Modify the NEON implementation to:

template <>
Vectorized<float> inline fmsub(const Vectorized<float>& a, const Vectorized<float>& b, const Vectorized<float>& c) {
  return Vectorized<float>(vnegq_f32(vfmsq_f32(c, a, b)));
}

The running result of ./build/bin/vec_test_all_types_DEFAULT after the modification:

[----------] 6 tests from BitwiseFloatsAdditional/0, where TypeParam = at::vec::DEFAULT::Vectorized<float>
[ RUN      ] BitwiseFloatsAdditional/0.ZeroMask
[       OK ] BitwiseFloatsAdditional/0.ZeroMask (0 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Convert
[       OK ] BitwiseFloatsAdditional/0.Convert (0 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Fmadd
[       OK ] BitwiseFloatsAdditional/0.Fmadd (78 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Fmsub
[       OK ] BitwiseFloatsAdditional/0.Fmsub (78 ms)
[ RUN      ] BitwiseFloatsAdditional/0.FmaddVecN
[       OK ] BitwiseFloatsAdditional/0.FmaddVecN (79 ms)
[ RUN      ] BitwiseFloatsAdditional/0.Blendv
[       OK ] BitwiseFloatsAdditional/0.Blendv (0 ms)
[----------] 6 tests from BitwiseFloatsAdditional/0 (236 ms total)

Versions

main

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: armRelated to ARM architectures builds of PyTorch. Includes Apple M1module: correctness (silent)issue that returns an incorrect result silentlymodule: vectorizationRelated to SIMD vectorization, e.g., Vec256triage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions