Add missing boundary checks to cunn_SoftMaxForward #140682

xinyazhang · 2024-11-14T07:11:27Z

This fixes OOB memory access for following code

import torch
qk = torch.randn((1024,587), dtype=torch.float64, device='cuda')
smqk = torch.softmax(qk, dim=-1)

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

This fixes OOB memory access for followng code ``` import torch qk = torch.randn((1024,587), dtype=torch.float64, device='cuda') smqk = torch.softmax(qk, dim=-1) ```

pytorch-bot · 2024-11-14T07:11:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140682

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 9f8c51c with merge base 99c8d5a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xinyazhang · 2024-11-14T07:12:50Z

The segfault can be reproduced on ROCM platform reliably with the following env vars

PYTORCH_NO_HIP_MEMORY_CACHING=1 HSA_SVM_GUARD_PAGES=1 HSA_DISABLE_FRAGMENT_ALLOCATOR=1 AMD_SERIALIZE_KERNEL=3

, which add guard pages to detect OOB memory access.

xinyazhang · 2024-11-14T07:16:00Z

@jithunnair-amd @pruthvistony @jataylo @jeffdaily
I'm speculating our large SDPA error is partially due to the OOB access in the softmax operator, which reads extra data when the input is irregular.

eqy

If you have a reproducer, could you add a test case for this change please?

malfet

@xinyazhang please add some sort of a test to the PR description, but otherwise looks good to me (we can not have unittest unfortunately, because at least on CUDA something like that would not be detectable without compute sanitizer)

xinyazhang · 2024-11-14T22:33:11Z

If you have a reproducer, could you add a test case for this change please?

@eqy I have a testing case in mind, but where should I add it? (or according to @malfet we cannot add it)

malfet · 2024-11-14T22:39:10Z

I have a testing case in mind, but where should I add it? (or according to @malfet we cannot add it)

I think PR description should be sufficient, testing it with compute sanitizer now. And it fails:

PYTORCH_NO_CUDA_MEMORY_CACHING=1 /usr/local/cuda-12.3/bin/compute-sanitizer python -c "import torch;qk = torch.randn((1024,587), dtype=torch.float64, device='cuda'); print(torch.softmax(qk, dim=-1))"
========= COMPUTE-SANITIZER
========= Invalid __global__ read of size 8 bytes
=========     at void at::native::<unnamed>::cunn_SoftMaxForward<(int)2, double, double, double, at::native::<unnamed>::SoftMaxForwardEpilogue>(T4 *, const T2 *, int)+0x190
=========     by thread (588,0,0) in block (1023,0,0)
=========     Address 0x7fcb5f696000 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7fcb5f200000 of size 4808704 bytes
...

eqy · 2024-11-14T22:41:47Z

@xinyazhang please add some sort of a test to the PR description, but otherwise looks good to me (we can not have unittest unfortunately, because at least on CUDA something like that would not be detectable without compute sanitizer)

Would we get some signal with torch.cuda.empty_cache() around the test?

malfet · 2024-11-14T22:45:36Z

@xinyazhang please add some sort of a test to the PR description, but otherwise looks good to me (we can not have unittest unfortunately, because at least on CUDA something like that would not be detectable without compute sanitizer)

Would we get some signal with torch.cuda.empty_cache() around the test?

Nope, as GPU maps memory into its address space in a pretty large chunks(A100 page size is probably 2Mb or something), see

$ PYTORCH_NO_CUDA_MEMORY_CACHING=1 python -c "import torch;qk = torch.randn((1024,587), dtype=torch.float64, device='cuda'); print(torch.softmax(qk, dim=-1))"; echo $?
0

vs launching the same with compute sanitizer, that will catch an error

malfet · 2024-11-14T22:47:13Z

@pytorchbot merge -f "Looks reasonable"

pytorchmergebot · 2024-11-14T22:48:52Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xinyazhang · 2024-11-14T23:00:07Z

For documentation @malfet @eqy @jeffdaily @jithunnair-amd here is the independent unit test code.

#!/usr/bin/env python

import torch
import pytest

@pytest.mark.parametrize("dtype",
                         [torch.float16, torch.bfloat16, torch.float32, torch.float64],
                         ids=['fp16', 'bf16', 'fp32', 'fp64'])
def test_softmax_oob_access(dtype):
    qk_with_margin = torch.randn((1024+1, 587), dtype=dtype, device="cuda");
    qk_with_margin[-1].fill_(float('nan'))
    qk = qk_with_margin[:-1, :]
    smqk = torch.softmax(qk, dim=-1)
    assert not torch.isnan(smqk).any(), 'NaN indicates OOB memory access'

It only fails on float64 apparently, (but our SDPA's UT suffers from it and has to use CPU implementation on certain cases)

malfet · 2024-11-14T23:15:18Z

@xinyazhang that's smart and indeed reliable, please do not hesitate to propose a PR that adds this unites

This fixes OOB memory access for following code ```python import torch qk = torch.randn((1024,587), dtype=torch.float64, device='cuda') smqk = torch.softmax(qk, dim=-1) ``` Pull Request resolved: pytorch#140682 Approved by: https://github.com/jeffdaily, https://github.com/malfet

Add missing boundary checks

9f8c51c

This fixes OOB memory access for followng code ``` import torch qk = torch.randn((1024,587), dtype=torch.float64, device='cuda') smqk = torch.softmax(qk, dim=-1) ```

pytorch-bot bot added the release notes: cuda release notes category label Nov 14, 2024

pytorchbot added the open source label Nov 14, 2024

pruthvistony added topic: not user facing topic category module: rocm AMD GPU support for Pytorch rocm This tag is for PRs from ROCm team ciflow/rocm Trigger "default" config CI on ROCm and removed release notes: cuda release notes category labels Nov 14, 2024

jeffdaily approved these changes Nov 14, 2024

View reviewed changes

pruthvistony marked this pull request as ready for review November 14, 2024 18:12

pruthvistony requested review from eqy and syed-ahmed as code owners November 14, 2024 18:12

pruthvistony requested a review from malfet November 14, 2024 18:13

eqy reviewed Nov 14, 2024

View reviewed changes

malfet approved these changes Nov 14, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 14, 2024

pytorchmergebot closed this in 7621fc5 Nov 14, 2024

pytorchmergebot added Merged and removed merging labels Nov 14, 2024

Add missing boundary checks to cunn_SoftMaxForward #140682

Add missing boundary checks to cunn_SoftMaxForward #140682

Uh oh!

Conversation

xinyazhang commented Nov 14, 2024 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140682

❗ 1 Active SEVs

✅ No Failures

Uh oh!

xinyazhang commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinyazhang commented Nov 14, 2024

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

xinyazhang commented Nov 14, 2024

Uh oh!

malfet commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy commented Nov 14, 2024

Uh oh!

malfet commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Nov 14, 2024

Uh oh!

pytorchmergebot commented Nov 14, 2024

Merge started

Uh oh!

xinyazhang commented Nov 14, 2024 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Nov 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

xinyazhang commented Nov 14, 2024 •

edited by malfet

Loading

pytorch-bot bot commented Nov 14, 2024 •

edited

Loading

xinyazhang commented Nov 14, 2024 •

edited

Loading

malfet commented Nov 14, 2024 •

edited

Loading

malfet commented Nov 14, 2024 •

edited

Loading

xinyazhang commented Nov 14, 2024 •

edited by malfet

Loading