KEMBAR78
[MPS] Allow nan mean reduction in `nll_loss` by hvaara · Pull Request #135434 · pytorch/pytorch · GitHub
Skip to content

Conversation

@hvaara
Copy link
Contributor

@hvaara hvaara commented Sep 8, 2024

This PR allows results from nn_loss to be nan, which is the same behavior as with CUDA and CPU #64572 (comment).

Fixes #134431

Ref #64572 #119108

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 8, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135434

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2eac2f0 with merge base 042f2f7 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@hvaara
Copy link
Contributor Author

hvaara commented Sep 8, 2024

Would appreciate it if someone could add the ciflow/mps label 🙏

@ezyang ezyang added the ciflow/mps Run MPS tests (subset of trunk) label Sep 9, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 9, 2024

Please seek CI approval before scheduling CIFlow labels

@pytorch-bot pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Sep 9, 2024
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 9, 2024
@hvaara
Copy link
Contributor Author

hvaara commented Sep 9, 2024

@pytorchbot label "ciflow/mps"

@pytorch-bot pytorch-bot bot added the ciflow/mps Run MPS tests (subset of trunk) label Sep 9, 2024
@hvaara
Copy link
Contributor Author

hvaara commented Sep 9, 2024

Interesting. TestModuleMPS.test_forward_nn_Bilinear_mps_float16 fails in CI, but passes locally:

test_modules.py::TestModuleMPS::test_forward_nn_Bilinear_mps_float16 PASSED [0.0456s]
test_modules.py::TestModuleMPS::test_forward_nn_Bilinear_mps_float32 PASSED [0.0388s]

Not sure why, but I'll disable these again for now.

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this a regression from #94226 which attempted to solve the same problem a while back and even added regression tests for it..

Comment on lines -540 to +556
mpsGraphReducedTensor = divisionNoNaN(mpsGraph, mpsGraphReducedTensor, mpsGraphBatchSizeTensor);
mpsGraphReducedTensor = [mpsGraph divisionWithPrimaryTensor:mpsGraphReducedTensor
secondaryTensor:mpsGraphBatchSizeTensor
name:@"divisionTensor"];
Copy link
Contributor

@malfet malfet Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, why this is safe/needed? Are you saying that NaN in reduced tensors should propagate thru?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is partially what this PR is fixing. It is needed in the case where weight elems are zero. Without it

self.assertEqual(F.nll_loss(input, target, weight, reduction="mean").item(), float("nan"))
will fail.

test/test_nn.py Outdated
from torch.testing._internal.common_device_type import dtypesIfMPS, instantiate_device_type_tests, dtypes, \
dtypesIfCUDA, precisionOverride, skipCUDAIfCudnnVersionLessThan, onlyCUDA, onlyCPU, \
skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, \
skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, skipMPSVersionIfLessThan, \
Copy link
Contributor Author

@hvaara hvaara Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet I just saw #134858. Sorry if I broke something. I think that was added in test_nn.py by me (#134184). I guess I shouldn't use skipMPSVersionIfLessThan? Can you walk me through why it's bad? Or point me to some place where I can learn why it's discouraged in PyTorch?

@malfet malfet added topic: bug fixes topic category ciflow/trunk Trigger trunk jobs on your pull request labels Sep 10, 2024
@malfet
Copy link
Contributor

malfet commented Sep 10, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@hvaara hvaara deleted the nll-loss-nan-fix branch September 13, 2024 22:39
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU pytorch#64572 (comment).

Fixes pytorch#134431

Ref pytorch#64572 pytorch#119108
Pull Request resolved: pytorch#135434
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: mps Release notes category topic: bug fixes topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MPS] F.nll_loss errors with empty tensor

5 participants