-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
🐛 Describe the bug
We have identified 5 tests that are failing due to segmentation fault on AAarch64 neoverse-v1. ( neoverse-v2 i.e. aws c8g seems to be unaffected ).
How we identified this - Our workflow is to test the unit tests with a manywheel build where as linux-aarch64.yml workflow builds inside a jammy container. This is why these tests are currently passing in CI.
test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault
How to reproduce
This can be reproduced consistently with a nightly build. You will need a neoverse-v1 ( e.g. aws c7g ). First install the pytorch nightly then run any of these tests.
pytorch$ python test/test_ops.py TestCommonCPU.test_dtypes_grid_sampler_2d_cpu
Segmentation fault
Cause
We have identified the cause to be this PR - #152825 which was merged about 2 months ago.
I have confirmed by reverting this patch that all of the above tests pass again. This explains why the CI is currently passing because the PR did not upgrade jammy to gcc13 simultaneously. AFAIK in linux-aarch64.yml the build is executed in a jammy container not the manylinux container.
Next Steps
There are a few possible resolutions we could take here
- We could revert Use gcc13 in Manylinux 2.28 images #152825 in it's entirety.
- We could revert ONLY AAarch64 images back to GCC11.
- It has been suggested that a likely cause is GCC's auto-vectorizer so we could try to disable that.
Versions
Collecting environment information...
PyTorch version: 2.9.0.dev20250704+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.10.18 | packaged by conda-forge | (main, Jun 4 2025, 14:39:45) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-1031-aws-aarch64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: ARM
Model: 1
Thread(s) per core: 1
Core(s) per cluster: 16
Socket(s): -
Cluster(s): 1
Stepping: r1p1
BogoMIPS: 2100.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
L1d cache: 1 MiB (16 instances)
L1i cache: 1 MiB (16 instances)
L2 cache: 16 MiB (16 instances)
L3 cache: 32 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Mitigation; CSV2, BHB
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] mypy==1.16.0
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.22.4
[pip3] onnx==1.18.0
[pip3] onnx-ir==0.1.3
[pip3] onnxscript==0.3.1
[pip3] optree==0.13.0
[pip3] torch==2.9.0.dev20250704+cpu
[conda] No relevant packages
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01