[foreach] Fix 0-size handling for real for real #109402

janeyx99 · 2023-09-15T21:45:02Z

@crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because:

if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed.
the nested forloop pronounces side effects on tensorListMeta that shouldn't be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through.

We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by:

removing the finagling of chunks when the tail tensor is 0-sized
adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still.

test plan

As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup.

cc @awgu @crcrpar

Stack from ghstack (oldest at bottom):

-> [foreach] Fix 0-size handling for real for real #109402

[ghstack-poisoned]

pytorch-bot · 2023-09-15T21:45:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109402

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 26a342a with merge base a565f1b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]

ghstack-source-id: 701970c Pull Request resolved: #109402

crcrpar · 2023-09-16T03:31:08Z

I vaguely remember ngimel suggested it should be possible to filter zero size tensors and pass non zero size tensors to multi tensor apply kernel. I failed to do so that time but would it be worth a try now?

janeyx99 · 2023-09-16T03:43:44Z

haha i remember us saying we should refactor. i've taken a look and the refactoring wouldn't get more efficient than now, and could get a bit more complicated due to the fact that blocks OR tensors could fill up. could be worth a shot rewriting with filtering, but running through the tensors once seems better than having two passes.

janeyx99 · 2023-09-16T04:15:09Z

actually we may be able to do so as a part of the check fast path api...that may be a lot easier....especially because currently this PR still wouldn't fix foreach_norm.

it looks like i'd be changing the dichotomy of the check_fast_path_api to take in std::vecs instead of ArrayRefs so we could easily drop size-0 tensors. I will explore the viability of this issue later, but here's the plan:

if just doing a filtering is simplest and works, we can just go with that on Monday
if the filtering is nontrivial and will take longer than Monday, I will land this change with XFAILs for things that should work but never worked to unblock people + then keep iterating on idea 1.

torch/testing/_internal/opinfo/core.py

crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]

ghstack-source-id: bd07c18 Pull Request resolved: #109402

crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]

janeyx99 · 2023-09-21T16:01:38Z

Update: I've gone on a grand endeavor to do filtering first, but it is a much more involved change that seems to affect other parts of the stack. (See my attempt at #109550).

The safest path of procession is:

tidy up and land this change
land the big change cautiously as a refactor

here's to hoping for green CI

crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]

ghstack-source-id: 1a5add0 Pull Request resolved: #109402

janeyx99 · 2023-09-21T20:21:43Z

torch/testing/_internal/common_methods_invocations.py

                toleranceOverride(
                    {
-                        torch.complex64: tol(atol=1e-05, rtol=1e-05)
+                        torch.complex64: tol(atol=3e-04, rtol=2e-05)


@parth-desai @peterbell10 hi 3e-04 seems like a decently large atol to me, and I did confirm locally that the reason for this disparity is the jiterator change. One can repo the following provides different results before and after #102427.

import torch x = torch.tensor(-7.8167-0.0451j, device='cuda:0') torch.set_printoptions(precision=10) print(torch.tan(x)) print(torch._foreach_tan([x])[0]) print(torch._foreach_tan([x.to(device="cpu")])[0])

Before:

After:

This PR just happened to catch this since I added more sample inputs to test the empty tensor case so the seed changed. I'm wondering if this is acceptable or whether an issue should be raised to call this out.

Or one can run python test/test_foreach.py -k test_parity__foreach_tan_slowpath_outplace_cuda_complex64 without the tolerance changes to repro as well

Agreed that does look quite bad. I think it's okay to revert the changes in UnaryGeometricTanKernel.cu.

Hi Jane, Please raise an issue. I will try to fix it in a separate PR.

jiterator uses different complex math implementations (from llvm) than thrust (which is used throughout eager codebase), I think we already had similar discrepancies with sigmoid? Worth checking.

Opened an issue #110014

aten/src/ATen/native/cuda/ForeachReduceOp.cu

aten/src/ATen/native/cuda/MultiTensorApply.cuh

crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]

ghstack-source-id: c6ff3f3 Pull Request resolved: #109402

albanD

Sounds good!
Let's follow up on the precision issue in details

albanD · 2023-09-26T15:51:10Z

torch/testing/_internal/common_methods_invocations.py

+        for num_tensors, rightmost_arg_type, intersperse_empty_tensors in itertools.product(
+                num_input_tensors, self._rightmost_arg_types, (True, False)):
+            if intersperse_empty_tensors and (num_tensors != max(num_input_tensors) or str(device) == 'cpu'):
+                # generate interspersed empty tensors for only 1 N on non-cpu device to lessen redundancy


Yes, like the largest N.

janeyx99 · 2023-09-26T17:36:29Z

@pytorchbot merge

pytorchmergebot · 2023-09-26T17:38:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@r-barnes

Followup edits to #109402 as suggested by @r-barnes Pull Request resolved: #110228 Approved by: https://github.com/drisspg

[foreach] Fix 0-size handling for real for real

724fb8e

[ghstack-poisoned]

pytorch-bot bot added the release notes: foreach_frontend release notes category label Sep 15, 2023

janeyx99 added the topic: bug fixes topic category label Sep 15, 2023

janeyx99 added a commit that referenced this pull request Sep 15, 2023

[foreach] Fix 0-size handling for real for real

14f1650

ghstack-source-id: 701970c Pull Request resolved: #109402

janeyx99 marked this pull request as draft September 16, 2023 04:16

janeyx99 commented Sep 16, 2023

View reviewed changes

torch/testing/_internal/opinfo/core.py Outdated Show resolved Hide resolved

janeyx99 added a commit that referenced this pull request Sep 18, 2023

[foreach] Fix 0-size handling for real for real

b8fe5e8

ghstack-source-id: bd07c18 Pull Request resolved: #109402

janeyx99 added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 21, 2023

janeyx99 mentioned this pull request Sep 21, 2023

[foreach] check for empty tensors before dispatching to MTA #109550

Closed

janeyx99 added a commit that referenced this pull request Sep 21, 2023

[foreach] Fix 0-size handling for real for real

3141772

ghstack-source-id: 1a5add0 Pull Request resolved: #109402

janeyx99 marked this pull request as ready for review September 21, 2023 20:15

janeyx99 requested review from mruberry and ngimel as code owners September 21, 2023 20:15

janeyx99 commented Sep 21, 2023

View reviewed changes

janeyx99 requested a review from albanD September 21, 2023 20:22

ngimel reviewed Sep 22, 2023

View reviewed changes

aten/src/ATen/native/cuda/ForeachReduceOp.cu Show resolved Hide resolved

ngimel reviewed Sep 22, 2023

View reviewed changes

aten/src/ATen/native/cuda/MultiTensorApply.cuh Outdated Show resolved Hide resolved

janeyx99 mentioned this pull request Sep 25, 2023

tan/tanh discrepancies with complex due to jiterator #110014

Open

janeyx99 added a commit that referenced this pull request Sep 25, 2023

[foreach] Fix 0-size handling for real for real

658beaa

ghstack-source-id: c6ff3f3 Pull Request resolved: #109402

albanD approved these changes Sep 26, 2023

View reviewed changes

pytorchmergebot added the merging label Sep 26, 2023

pytorchmergebot added Merged and removed merging labels Sep 26, 2023

pytorchmergebot closed this in 0a60219 Sep 26, 2023

janeyx99 mentioned this pull request Sep 28, 2023

[foreach][BE] cleaning up MultiTensorApply.cuh #110228

Closed

pytorchmergebot pushed a commit that referenced this pull request Sep 29, 2023

[foreach][BE] cleaning up MultiTensorApply.cuh (#110228)

c9511e8

Followup edits to #109402 as suggested by @r-barnes Pull Request resolved: #110228 Approved by: https://github.com/drisspg

facebook-github-bot deleted the gh/janeyx99/89/head branch September 30, 2023 14:22

janeyx99 mentioned this pull request Oct 4, 2023

Empty tensor passed to optimizer prevents parameters update on CUDA device #100701

Closed

janeyx99 mentioned this pull request Dec 14, 2023

[BE] Refactor logic for MultiTensorApply #101096

Closed

janeyx99 mentioned this pull request Nov 8, 2024

_foreach_norm produces wrong results #140066

Closed

[foreach] Fix 0-size handling for real for real #109402

[foreach] Fix 0-size handling for real for real #109402

Uh oh!

Conversation

janeyx99 commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

test plan

Uh oh!

pytorch-bot bot commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109402

⏳ No Failures, 1 Pending

Uh oh!

crcrpar commented Sep 16, 2023

Uh oh!

janeyx99 commented Sep 16, 2023

Uh oh!

janeyx99 commented Sep 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

janeyx99 commented Sep 21, 2023

Uh oh!

janeyx99 Sep 21, 2023 • edited by peterbell10 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janeyx99 Sep 21, 2023

Choose a reason for hiding this comment

Uh oh!

peterbell10 Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

parth-desai Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

ngimel Sep 22, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 Sep 25, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD Sep 26, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 Sep 26, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Sep 26, 2023

Uh oh!

pytorchmergebot commented Sep 26, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

janeyx99 commented Sep 15, 2023 •

edited

Loading

pytorch-bot bot commented Sep 15, 2023 •

edited

Loading

janeyx99 commented Sep 16, 2023 •

edited

Loading

janeyx99 Sep 21, 2023 •

edited by peterbell10

Loading