KEMBAR78
Add missing boundary checks to cunn_SoftMaxForward by xinyazhang · Pull Request #140682 · pytorch/pytorch · GitHub
Skip to content

Conversation

@xinyazhang
Copy link
Collaborator

@xinyazhang xinyazhang commented Nov 14, 2024

This fixes OOB memory access for following code

import torch
qk = torch.randn((1024,587), dtype=torch.float64, device='cuda')
smqk = torch.softmax(qk, dim=-1)

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

This fixes OOB memory access for followng code
```
import torch
qk = torch.randn((1024,587), dtype=torch.float64, device='cuda')
smqk = torch.softmax(qk, dim=-1)
```
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140682

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 9f8c51c with merge base 99c8d5a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label Nov 14, 2024
@xinyazhang
Copy link
Collaborator Author

xinyazhang commented Nov 14, 2024

The segfault can be reproduced on ROCM platform reliably with the following env vars

PYTORCH_NO_HIP_MEMORY_CACHING=1 HSA_SVM_GUARD_PAGES=1 HSA_DISABLE_FRAGMENT_ALLOCATOR=1 AMD_SERIALIZE_KERNEL=3 

, which add guard pages to detect OOB memory access.

@xinyazhang
Copy link
Collaborator Author

@jithunnair-amd @pruthvistony @jataylo @jeffdaily
I'm speculating our large SDPA error is partially due to the OOB access in the softmax operator, which reads extra data when the input is irregular.

@pruthvistony pruthvistony added topic: not user facing topic category module: rocm AMD GPU support for Pytorch rocm This tag is for PRs from ROCm team ciflow/rocm Trigger "default" config CI on ROCm and removed release notes: cuda release notes category labels Nov 14, 2024
@pruthvistony pruthvistony marked this pull request as ready for review November 14, 2024 18:12
@pruthvistony pruthvistony requested a review from malfet November 14, 2024 18:13
Copy link
Collaborator

@eqy eqy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a reproducer, could you add a test case for this change please?

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinyazhang please add some sort of a test to the PR description, but otherwise looks good to me (we can not have unittest unfortunately, because at least on CUDA something like that would not be detectable without compute sanitizer)

@xinyazhang
Copy link
Collaborator Author

If you have a reproducer, could you add a test case for this change please?

@eqy I have a testing case in mind, but where should I add it? (or according to @malfet we cannot add it)

@malfet
Copy link
Contributor

malfet commented Nov 14, 2024

I have a testing case in mind, but where should I add it? (or according to @malfet we cannot add it)

I think PR description should be sufficient, testing it with compute sanitizer now. And it fails:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 /usr/local/cuda-12.3/bin/compute-sanitizer python -c "import torch;qk = torch.randn((1024,587), dtype=torch.float64, device='cuda'); print(torch.softmax(qk, dim=-1))"
========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 8 bytes
=========     at void at::native::<unnamed>::cunn_SoftMaxForward<(int)2, double, double, double, at::native::<unnamed>::SoftMaxForwardEpilogue>(T4 *, const T2 *, int)+0x190
=========     by thread (588,0,0) in block (1023,0,0)
=========     Address 0x7fcb5f696000 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7fcb5f200000 of size 4808704 bytes
...

@eqy
Copy link
Collaborator

eqy commented Nov 14, 2024

@xinyazhang please add some sort of a test to the PR description, but otherwise looks good to me (we can not have unittest unfortunately, because at least on CUDA something like that would not be detectable without compute sanitizer)

Would we get some signal with torch.cuda.empty_cache() around the test?

@malfet
Copy link
Contributor

malfet commented Nov 14, 2024

@xinyazhang please add some sort of a test to the PR description, but otherwise looks good to me (we can not have unittest unfortunately, because at least on CUDA something like that would not be detectable without compute sanitizer)

Would we get some signal with torch.cuda.empty_cache() around the test?

Nope, as GPU maps memory into its address space in a pretty large chunks(A100 page size is probably 2Mb or something), see

$ PYTORCH_NO_CUDA_MEMORY_CACHING=1 python -c "import torch;qk = torch.randn((1024,587), dtype=torch.float64, device='cuda'); print(torch.softmax(qk, dim=-1))"; echo $?
0

vs launching the same with compute sanitizer, that will catch an error

@malfet
Copy link
Contributor

malfet commented Nov 14, 2024

@pytorchbot merge -f "Looks reasonable"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xinyazhang
Copy link
Collaborator Author

xinyazhang commented Nov 14, 2024

For documentation @malfet @eqy @jeffdaily @jithunnair-amd here is the independent unit test code.

#!/usr/bin/env python

import torch
import pytest

@pytest.mark.parametrize("dtype",
                         [torch.float16, torch.bfloat16, torch.float32, torch.float64],
                         ids=['fp16', 'bf16', 'fp32', 'fp64'])
def test_softmax_oob_access(dtype):
    qk_with_margin = torch.randn((1024+1, 587), dtype=dtype, device="cuda");
    qk_with_margin[-1].fill_(float('nan'))
    qk = qk_with_margin[:-1, :]
    smqk = torch.softmax(qk, dim=-1)
    assert not torch.isnan(smqk).any(), 'NaN indicates OOB memory access'

It only fails on float64 apparently, (but our SDPA's UT suffers from it and has to use CPU implementation on certain cases)

@malfet
Copy link
Contributor

malfet commented Nov 14, 2024

@xinyazhang that's smart and indeed reliable, please do not hesitate to propose a PR that adds this unites

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This fixes OOB memory access for following code
```python
import torch
qk = torch.randn((1024,587), dtype=torch.float64, device='cuda')
smqk = torch.softmax(qk, dim=-1)
```

Pull Request resolved: pytorch#140682
Approved by: https://github.com/jeffdaily, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm Merged module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants