torch.norm on CPU gives incorrect results for large Tensors

## 🐛 Bug

```python
>>> import torch
>>> x = torch.ones(40000)
>>> torch.norm(x)
tensor(266.0605)
>>> torch.norm(x.to(dtype=torch.float32, device='cuda'))
tensor(200., device='cuda:0')
```

Originally reported at https://discuss.pytorch.org/t/output-of-torch-norm-x-depends-on-size-of-x-not-in-a-good-way/33299/8