-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[foreach] Fix 0-size handling for real for real #109402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109402
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 26a342a with merge base a565f1b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]
|
I vaguely remember ngimel suggested it should be possible to filter zero size tensors and pass non zero size tensors to multi tensor apply kernel. I failed to do so that time but would it be worth a try now? |
|
haha i remember us saying we should refactor. i've taken a look and the refactoring wouldn't get more efficient than now, and could get a bit more complicated due to the fact that blocks OR tensors could fill up. could be worth a shot rewriting with filtering, but running through the tensors once seems better than having two passes. |
|
actually we may be able to do so as a part of the check fast path api...that may be a lot easier....especially because currently this PR still wouldn't fix foreach_norm. it looks like i'd be changing the dichotomy of the check_fast_path_api to take in std::vecs instead of ArrayRefs so we could easily drop size-0 tensors. I will explore the viability of this issue later, but here's the plan:
|
crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]
crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]
|
Update: I've gone on a grand endeavor to do filtering first, but it is a much more involved change that seems to affect other parts of the stack. (See my attempt at #109550). The safest path of procession is:
here's to hoping for green CI |
crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]
| toleranceOverride( | ||
| { | ||
| torch.complex64: tol(atol=1e-05, rtol=1e-05) | ||
| torch.complex64: tol(atol=3e-04, rtol=2e-05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parth-desai @peterbell10 hi 3e-04 seems like a decently large atol to me, and I did confirm locally that the reason for this disparity is the jiterator change. One can repo the following provides different results before and after #102427.
import torch
x = torch.tensor(-7.8167-0.0451j, device='cuda:0')
torch.set_printoptions(precision=10)
print(torch.tan(x))
print(torch._foreach_tan([x])[0])
print(torch._foreach_tan([x.to(device="cpu")])[0])
This PR just happened to catch this since I added more sample inputs to test the empty tensor case so the seed changed. I'm wondering if this is acceptable or whether an issue should be raised to call this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or one can run python test/test_foreach.py -k test_parity__foreach_tan_slowpath_outplace_cuda_complex64 without the tolerance changes to repro as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that does look quite bad. I think it's okay to revert the changes in UnaryGeometricTanKernel.cu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Jane, Please raise an issue. I will try to fix it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jiterator uses different complex math implementations (from llvm) than thrust (which is used throughout eager codebase), I think we already had similar discrepancies with sigmoid? Worth checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue #110014
crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]
crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because: 1. if the previous tensor was also 0 sized, (so a tensor list of [tensor, tensor, tensor, ..., 0-sized tensor, 0-sized tensor],) chunks would still be 0 and the nested for loop would be missed. 2. the nested forloop pronounces side effects on tensorListMeta that _shouldn't_ be there! This can mess up the compute in unexpected ways that I haven't really needed to reason through. We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by: - removing the finagling of chunks when the tail tensor is 0-sized - adding a surefire way for the kernel to be launched in the case where the last tensor is 0-sized AND there's content in the metadata, signifying there is stuff to compute still. ## test plan As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup. cc awgu crcrpar [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Let's follow up on the precision issue in details
| for num_tensors, rightmost_arg_type, intersperse_empty_tensors in itertools.product( | ||
| num_input_tensors, self._rightmost_arg_types, (True, False)): | ||
| if intersperse_empty_tensors and (num_tensors != max(num_input_tensors) or str(device) == 'cpu'): | ||
| # generate interspersed empty tensors for only 1 N on non-cpu device to lessen redundancy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only 1 N ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, like the largest N.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Followup edits to #109402 as suggested by @r-barnes Pull Request resolved: #110228 Approved by: https://github.com/drisspg


@crcrpar's last attempt to fix the 0-size problem unfortunately did not pass all cases. See my comment in #100701. When we have a tail tensor of size 0, the old code would mess with the chunk logic to check the previous tensor's length. This is flawed because:
We noticed that the problem had not been fixed due to an internal report. This PR solves the issue by:
test plan
As I went through the code, I also added some comments explaining what's up and modified our tensor inputs to ensure that this case is tested in the test_parity test in test_foreach.py. Yes, I do realize there is quite a bit of duplication and that this file could be due for a refactor. That said, the primary goal of this PR is to fix the pretty egregious bug and refactoring can be a followup.
cc @awgu @crcrpar
Stack from ghstack (oldest at bottom):