[MPS] Allow nan mean reduction in `nll_loss` #135434

hvaara · 2024-09-08T07:34:15Z

This PR allows results from nn_loss to be nan, which is the same behavior as with CUDA and CPU #64572 (comment).

Fixes #134431

Ref #64572 #119108

pytorch-bot · 2024-09-08T07:34:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135434

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 2eac2f0 with merge base 042f2f7 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_cuda_load'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

hvaara · 2024-09-08T07:44:59Z

Would appreciate it if someone could add the ciflow/mps label 🙏

pytorch-bot · 2024-09-09T00:47:32Z

Please seek CI approval before scheduling CIFlow labels

hvaara · 2024-09-09T01:00:30Z

@pytorchbot label "ciflow/mps"

hvaara · 2024-09-09T01:47:09Z

Interesting. TestModuleMPS.test_forward_nn_Bilinear_mps_float16 fails in CI, but passes locally:

test_modules.py::TestModuleMPS::test_forward_nn_Bilinear_mps_float16 PASSED [0.0456s]
test_modules.py::TestModuleMPS::test_forward_nn_Bilinear_mps_float32 PASSED [0.0388s]

Not sure why, but I'll disable these again for now.

malfet

Hmm, is this a regression from #94226 which attempted to solve the same problem a while back and even added regression tests for it..

aten/src/ATen/native/mps/operations/LossOps.mm

malfet · 2024-09-09T23:16:12Z

aten/src/ATen/native/mps/operations/LossOps.mm

-          mpsGraphReducedTensor = divisionNoNaN(mpsGraph, mpsGraphReducedTensor, mpsGraphBatchSizeTensor);
+          mpsGraphReducedTensor = [mpsGraph divisionWithPrimaryTensor:mpsGraphReducedTensor
+                                                      secondaryTensor:mpsGraphBatchSizeTensor
+                                                                 name:@"divisionTensor"];


So, why this is safe/needed? Are you saying that NaN in reduced tensors should propagate thru?

This is partially what this PR is fixing. It is needed in the case where weight elems are zero. Without it

pytorch/test/test_nn.py

Line 11599 in 39a6179

self.assertEqual(F.nll_loss(input, target, weight, reduction="mean").item(), float("nan"))

will fail.

hvaara · 2024-09-10T00:47:40Z

test/test_nn.py

 from torch.testing._internal.common_device_type import dtypesIfMPS, instantiate_device_type_tests, dtypes, \
    dtypesIfCUDA, precisionOverride, skipCUDAIfCudnnVersionLessThan, onlyCUDA, onlyCPU, \
-    skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, \
+    skipCUDAIfRocm, skipCUDAIf, skipCUDAIfNotRocm, skipMPSVersionIfLessThan, \


@malfet I just saw #134858. Sorry if I broke something. I think that was added in test_nn.py by me (#134184). I guess I shouldn't use skipMPSVersionIfLessThan? Can you walk me through why it's bad? Or point me to some place where I can learn why it's discouraged in PyTorch?

malfet · 2024-09-10T05:29:44Z

@pytorchbot merge

pytorchmergebot · 2024-09-10T05:32:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU pytorch#64572 (comment). Fixes pytorch#134431 Ref pytorch#64572 pytorch#119108 Pull Request resolved: pytorch#135434 Approved by: https://github.com/malfet

hvaara requested review from kulinseth and malfet as code owners September 8, 2024 07:34

pytorch-bot bot added the release notes: mps Release notes category label Sep 8, 2024

hvaara mentioned this pull request Sep 8, 2024

[MPS] Tracking issue for ModuleInfo failures when enabling testing for torch.float16 #119108

Open

39 tasks

pytorchbot added the open source label Sep 8, 2024

ezyang added the ciflow/mps Run MPS tests (subset of trunk) label Sep 9, 2024

pytorch-bot bot removed the ciflow/mps Run MPS tests (subset of trunk) label Sep 9, 2024

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 9, 2024

pytorch-bot bot added the ciflow/mps Run MPS tests (subset of trunk) label Sep 9, 2024

[MPS] Allow nan mean reduction in nll_loss

8accd08

hvaara force-pushed the nll-loss-nan-fix branch from 554b3e7 to 8accd08 Compare September 9, 2024 01:46

malfet reviewed Sep 9, 2024

View reviewed changes

aten/src/ATen/native/mps/operations/LossOps.mm Outdated Show resolved Hide resolved

malfet reviewed Sep 9, 2024

View reviewed changes

aten/src/ATen/native/mps/operations/LossOps.mm Show resolved Hide resolved

Reintroduce safe division for backward pass

a3c0ad3

malfet reviewed Sep 9, 2024

View reviewed changes

hvaara added 3 commits September 10, 2024 01:29

Extend empty tensor test to more dtypes for MPS

8b723ca

Skip tests for macOS 13 due to missing bfloat16 support

c3df9ac

Fix lint errors

ec02a1f

hvaara commented Sep 10, 2024

View reviewed changes

Remove usage of skipMPSVersionIfLessThan

2eac2f0

malfet approved these changes Sep 10, 2024

View reviewed changes

malfet added topic: bug fixes topic category ciflow/trunk Trigger trunk jobs on your pull request labels Sep 10, 2024

pytorchmergebot added the merging label Sep 10, 2024

pytorchmergebot added the Merged label Sep 10, 2024

pytorchmergebot closed this in 23b1486 Sep 10, 2024

pytorchmergebot removed the merging label Sep 10, 2024

hvaara deleted the nll-loss-nan-fix branch September 13, 2024 22:39

[MPS] Allow nan mean reduction in nll_loss #135434

[MPS] Allow nan mean reduction in nll_loss #135434

Uh oh!

Conversation

hvaara commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135434

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

hvaara commented Sep 8, 2024

Uh oh!

pytorch-bot bot commented Sep 9, 2024

Uh oh!

hvaara commented Sep 9, 2024

Uh oh!

hvaara commented Sep 9, 2024

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

malfet Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvaara Sep 9, 2024

Choose a reason for hiding this comment

Uh oh!

hvaara Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet commented Sep 10, 2024

Uh oh!

pytorchmergebot commented Sep 10, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[MPS] Allow nan mean reduction in `nll_loss` #135434

[MPS] Allow nan mean reduction in `nll_loss` #135434

hvaara commented Sep 8, 2024 •

edited

Loading

pytorch-bot bot commented Sep 8, 2024 •

edited

Loading

malfet Sep 9, 2024 •

edited

Loading

hvaara Sep 10, 2024 •

edited

Loading