`baddbmm` result affected by prior content of `out` tensor (NaNs preserved)

### 🐛 Describe the bug

`NaN`s in the `out` tensor argument to `baddbmm` seem to persist after the op computes, despite this being a supposedly write-only argument. 

This implies that the op reads `out`'s contents, which is wasted memory bandwidth at the very least.
But it also makes it impossible to trust tensors obtained through, for example, `torch.empty` or `empty_like` to be used in the `out` argument. The naive user will be expecting them to be fully overwritten, as in all other ops that have an `out` argument.

```python
a, b, c, z = [torch.rand((3,2,2)) for _ in range(4)]

z[:] = torch.nan
torch.addcmul(c, a, b, out=z)
print(z.isnan().sum())   # -> tensor(0), NaNs overwritten, great

z[:] = torch.nan
torch.baddbmm(c, a, b, alpha=1, beta=0, out=z)
print(z.isnan().sum())   # -> tensor(12), `z` is all NaNs

z = c
z[1,1,1] = z[0,0,0] = torch.nan   # plant two NaNs
torch.baddbmm(c, a, b, alpha=1, beta=0, out=z)
print(z.isnan().sum())   # -> tensor(2)  two NaNs preserved
```

BTW it's also annoying that `torch.baddbmm(None, a, b, alpha=1, beta=0)` raises `TypeError`, it should ignore `input` when `beta` is 0. 
Anyway, the bug above does not depend on `alpha` or `beta`.


### Versions

```
Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.16 (main, Mar  1 2023, 18:22:10)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           80
Model name:                      AMD Ryzen 9 5900HS with Radeon Graphics
Stepping:                        0
CPU MHz:                         3293.727
BogoMIPS:                        6587.45
Virtualization:                  AMD-V
Hypervisor vendor:               Microsoft
Virtualization type:             full
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        4 MiB
L3 cache:                        16 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm

Versions of relevant libraries:
[pip3] torch==1.13.1
[conda] blas                      1.0                         mkl
[conda] cpuonly                   2.0                           0    pytorch
[conda] mkl                       2023.0.0         h6d00ec8_25399
[conda] pytorch                   1.13.1              py3.9_cpu_0    pytorch
[conda] pytorch-mutex             1.0                         cpu    pytorch

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @Lezcano

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`baddbmm` result affected by prior content of `out` tensor (NaNs preserved) #96037

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

baddbmm result affected by prior content of out tensor (NaNs preserved) #96037

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`baddbmm` result affected by prior content of `out` tensor (NaNs preserved) #96037