ROCm: torch.cholesky_inverse raises Memory access fault for large tensor shapes

### 🐛 Describe the bug

Hi,

I noticed `torch.cholesky_inverse` raises `Memory access fault by GPU node-9 (Agent handle: 0x61edf0a17400) on address 0x71b279281000. Reason: Unknown.` using PyTorch ROCm distribution on MI300 while it does not using PyTorch Nvidia distribution on H100. I can reproduce this issue both with `torch==2.7.0+rocm6.3` and `torch==2.8.0.dev20250602+rocm6.4` (in `rocm/dev-ubuntu-22.04:6.4.1-complete` for the latter)

Reproduction:

```python
import torch

size = 48000
A = torch.randn(size, size, device="cuda").abs() + 1e-3
A = torch.triu(A)

L = torch.linalg.cholesky(A)

print("call cholesky_inverse")
L_inv = torch.cholesky_inverse(L)  # this segfaults.
```

48000 * 48000 * 4 * 1e-9 = 9.2 GB seems reasonably small.

I have the following backtrace on torch 2.7.0+rocm6.3 (I did not run gdb with torch nightly + rocm 6.4):

```
#0  0x00007ffff7d7b7f8 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
    req=req@entry=0x7fffffffb200, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
#1  0x00007ffff7d80677 in __GI___nanosleep (req=req@entry=0x7fffffffb200, rem=rem@entry=0x0)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007ffff7db1f2f in usleep (useconds=<optimized out>) at ../sysdeps/posix/usleep.c:31
#3  0x00007fff2885582a in rocr::core::BusyWaitSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libhsa-runtime64.so
#4  0x00007fff2885546a in rocr::core::BusyWaitSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libhsa-runtime64.so
#5  0x00007fff288593a1 in rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libhsa-runtime64.so
#6  0x00007fffa5e0dd9c in roctracer::hsa_support::detail::hsa_signal_wait_scacquire_callback(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libroctracer64.so
#7  0x00007fffa64edd13 in amd::roc::VirtualGPU::allocKernArg(unsigned long, unsigned long) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#8  0x00007fffa64f4646 in amd::roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*, hsa_kernel_dispatch_packet_s*) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#9  0x00007fffa64f4d34 in amd::roc::VirtualGPU::submitKernel(amd::NDRangeKernelCommand&) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#10 0x00007fffa64c27b7 in amd::Command::enqueue() ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#11 0x00007fffa63eeab5 in hip::ihipModuleLaunchKernel(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, unsigned int) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#12 0x00007fffa64072f1 in hip::ihipLaunchKernel(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, int) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#13 0x00007fffa63e46aa in hip::hipLaunchKernel_common(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#14 0x00007fffa63e5c1d in hip::hipLaunchKernel(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libamdhip64.so
#15 0x00007fff32f782ce in void trsm_template_batched_lNx<float, 16, 16>(magma_uplo_t, magma_diag_t, int, int, float, float**--Type <RET> for more, q to quit, c to continue without paging--
, int, float**, int, int, int, int, int, int, magma_queue*) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#16 0x00007fff32f6d875 in magmablas_strsm_small_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#17 0x00007fff32f6b271 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#18 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#19 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#20 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#21 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#22 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#23 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#24 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#25 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#26 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#27 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#28 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#29 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#30 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#31 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
--Type <RET> for more, q to quit, c to continue without paging--
#32 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#33 0x00007fff32f6a707 in magmablas_strsm_recursive_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#34 0x00007fff32f6b359 in magmablas_strsm_batched ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#35 0x00007fff32d2149e in magma_spotrs_batched () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libmagma.so
#36 0x00007fffd73d385b in void at::native::apply_cholesky_solve<float>(at::Tensor&, at::Tensor&, bool, long&) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#37 0x00007fffd73e1e83 in void at::native::apply_cholesky_inverse<float>(at::Tensor&, at::Tensor&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#38 0x00007fffd73e2143 in at::native::cholesky_inverse_kernel_impl(at::Tensor&, at::Tensor&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#39 0x00007fffe2c8dea4 in at::native::cholesky_inverse_out_info(at::Tensor&, at::Tensor&, at::Tensor const&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#40 0x00007fffe2c8e35d in at::native::cholesky_inverse_out(at::Tensor const&, bool, at::Tensor&) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#41 0x00007fffd74338fe in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_cholesky_inverse_out(at::Tensor const&, bool, at::Tensor&) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#42 0x00007fffe38a9644 in at::_ops::cholesky_inverse_out::call(at::Tensor const&, bool, at::Tensor&) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#43 0x00007fffe2c83f60 in at::native::cholesky_inverse(at::Tensor const&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#44 0x00007fffd742d55e in at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cholesky_inverse(at::Tensor const&, bool) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#45 0x00007fffd742d5e1 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, bool), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__cholesky_inverse>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, bool> >, at::Tensor (at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_hip.so
#46 0x00007fffe37fa73d in at::_ops::cholesky_inverse::redispatch(c10::DispatchKeySet, at::Tensor const&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#47 0x00007fffe5c8240b in torch::autograd::VariableType::(anonymous namespace)::cholesky_inverse(c10::DispatchKeySet, at::Tensor const&, bool) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
--Type <RET> for more, q to quit, c to continue without paging--
#48 0x00007fffe5c82ab4 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, bool), &torch::autograd::VariableType::(anonymous namespace)::cholesky_inverse>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, bool> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, bool) () from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
#49 0x00007fffe3849844 in at::_ops::cholesky_inverse::call(at::Tensor const&, bool) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so


#50 0x00007ffff62d4aa1 in torch::autograd::THPVariable_cholesky_inverse(_object*, _object*, _object*) ()
   from /root/miniforge3/lib/python3.12/site-packages/torch/lib/libtorch_python.so
#51 0x0000555555778918 in cfunction_call (func=warning: could not convert 'PyObject' from the host encoding (ISO-8859-1) to UTF-32.
This normally should not happen, please file a bug report.
<built-in method cholesky_inverse of type object at remote 0x7ffff6f5bf40>,
    args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.12.9/Objects/methodobject.c:537
#52 0x0000555555758c23 in _PyObject_MakeTpCall (tstate=0x555555be2e50 <_PyRuntime+458992>,
    callable=<built-in method cholesky_inverse of type object at remote 0x7ffff6f5bf40>, args=0x7ffff7fb7788,
    nargs=<optimized out>, keywords=0x0) at /usr/local/src/conda/python-3.12.9/Objects/call.c:240
#53 0x00005555556667e4 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=0x7ffff7fb75d8,
    throwflag=<optimized out>) at Python/bytecodes.c:2715
#54 0x000055555580f341 in PyEval_EvalCode (co=co@entry=<code at remote 0x555555dc6600>,
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated),
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, '--Type <RET> for more, q to quit, c to continue without paging--
SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated))
    at /usr/local/src/conda/python-3.12.9/Python/ceval.c:580
#55 0x00005555558339ba in run_eval_code_obj (tstate=tstate@entry=0x555555be2e50 <_PyRuntime+458992>,
    co=co@entry=0x555555dc6600,
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated),
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated))
    at /usr/local/src/conda/python-3.12.9/Python/pythonrun.c:1716
#56 0x000055555582e9c5 in run_mod (mod=mod@entry=0x555555dc7da8,
    filename=filename@entry='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py',
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated),
--Type <RET> for more, q to quit, c to continue without paging--
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated), flags=flags@entry=0x7fffffffdee0,
    arena=arena@entry=0x7ffff7b4d830) at /usr/local/src/conda/python-3.12.9/Python/pythonrun.c:1737
#57 0x00005555558472d0 in pyrun_file (fp=fp@entry=0x555555be59e0,
    filename=filename@entry='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py',
    start=start@entry=257,
    globals=globals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated),
    locals=locals@entry={'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <SourceFileLoader(name='__main__', path='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py') at remote 0x7ffff7c1fa40>, '__spec__': None, '__annotations__': {}, '__builtins__': <module at remote 0x7ffff7bd0270>, '__file__': '/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', '__cached__': None, 'sys': <module at remote 0x7ffff7bc27a0>, 'os': <module at remote 0x7ffff7c2cea0>, 'Path': <type at remote 0x555555d0f070>, 'torch': <module at remote 0x7ffff7942390>, 'warnings': <module at remote 0x7ffff7a01760>, 'argparse': <module at remote 0x7ffdbda4b650>, 'ModelQuantizer': <type at remote 0x55557979ee40>, 'ModelImporter': <function at remote 0x7fbcec7e8ae0>, 'ModelExporter': <type at remote 0x555579c55640>, 'load_params': <function at remote 0x7fbd03e47ec0>, 'save_params': <function at remote 0x7fbd03e93d80>, 'SUPPORTED_QUANT_SCHEME': ['w_uint4_per_group_asym', 'w_b...(truncated), closeit=closeit@entry=1, flags=0x7fffffffdee0)
    at /usr/local/src/conda/python-3.12.9/Python/pythonrun.c:1637
#58 0x000055555584694e in _PyRun_SimpleFileObject (fp=0x555555be59e0,
    filename='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py', closeit=1,
--Type <RET> for more, q to quit, c to continue without paging--
    flags=0x7fffffffdee0) at /usr/local/src/conda/python-3.12.9/Python/pythonrun.c:433
#59 0x0000555555846614 in _PyRun_AnyFileObject (fp=0x555555be59e0,
    filename=filename@entry='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py',
    closeit=closeit@entry=1, flags=flags@entry=0x7fffffffdee0) at /usr/local/src/conda/python-3.12.9/Python/pythonrun.c:78
#60 0x000055555583f89e in pymain_run_file_obj (skip_source_first_line=0,
    filename='/shared_volume/repos/quark/examples/torch/language_modeling/llm_ptq/quantize_quark.py',
    program_name='/root/miniforge3/bin/python') at /usr/local/src/conda/python-3.12.9/Modules/main.c:361
#61 pymain_run_file (config=0x555555b85a30 <_PyRuntime+77008>) at /usr/local/src/conda/python-3.12.9/Modules/main.c:380
#62 pymain_run_python (exitcode=0x7fffffffdeb4) at /usr/local/src/conda/python-3.12.9/Modules/main.c:634
#63 Py_RunMain () at /usr/local/src/conda/python-3.12.9/Modules/main.c:714
#64 0x00005555557f9477 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>)
    at /usr/local/src/conda/python-3.12.9/Modules/main.c:768
#65 0x00007ffff7cbfd90 in __libc_start_call_main (main=main@entry=0x5555557f93c0 <main>, argc=argc@entry=15,
    argv=argv@entry=0x7fffffffe138) at ../sysdeps/nptl/libc_start_call_main.h:58
#66 0x00007ffff7cbfe40 in __libc_start_main_impl (main=0x5555557f93c0 <main>, argc=15, argv=0x7fffffffe138,
    init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe128)
    at ../csu/libc-start.c:392
#67 0x00005555557f9321 in _start ()
```

### Versions

```
Python platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.4.43482
MIOpen runtime version: 3.4.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               256
On-line CPU(s) list:                  0-255
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9554 64-Core Processor
CPU family:                           25
Model:                                17
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          3762.9880
CPU min MHz:                          1500.0000
BogoMIPS:                             6190.48
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
Virtualization:                       AMD-V
L1d cache:                            4 MiB (128 instances)
L1i cache:                            4 MiB (128 instances)
L2 cache:                             128 MiB (128 instances)
L3 cache:                             512 MiB (16 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-63,128-191
NUMA node1 CPU(s):                    64-127,192-255
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] pytorch-triton-rocm==3.3.1+gitc8757738
[pip3] torch==2.8.0.dev20250602+rocm6.4
[pip3] torchaudio==2.8.0.dev20250603+rocm6.4
[pip3] torchvision==0.23.0.dev20250603+rocm6.4
[conda] numpy                     2.1.2                    pypi_0    pypi
[conda] pytorch-triton-rocm       3.3.1+gitc8757738          pypi_0    pypi
[conda] torch                     2.8.0.dev20250602+rocm6.4          pypi_0    pypi
[conda] torchaudio                2.8.0.dev20250603+rocm6.4          pypi_0    pypi
[conda] torchvision               0.23.0.dev20250603+rocm6.4          pypi_0    pypi
```

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROCm: torch.cholesky_inverse raises Memory access fault for large tensor shapes #155046

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ROCm: torch.cholesky_inverse raises Memory access fault for large tensor shapes #155046

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions