-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend #158903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158903
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 0844cd0 with merge base 42a69f7 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, a test change would be nice as well
| if (context_->rank == root) { | ||
| flatOutputTensor = newLikeFlat(outputs[0]); | ||
| GENERATE_ALL_TYPES(scalarType, setOutput, opts, flatOutputTensor); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you look into test_gather_noncontiguous_input
pytorch/test/distributed/test_c10d_gloo.py
Lines 1156 to 1158 in 85ee2fb
| def test_gather_noncontiguous_input(self): | |
| # Take a column of 2D tensor, such that memory is not dense | |
| self._test_gather_basics(lambda t: t.expand(2, 2).contiguous()[:, 0]) |
run with python test/distributed/test_c10d_gloo.py -k test_gather_noncontiguous_input
Why doesn't the test cover this case and could you update it to test with something similar to the example you provided? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the test case, and the answer was indeed correct and the test was passed.
The reason is that the tensors used in the test case are filled with same value, so gather can return correct results even if it did not gather the tensors properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I notice that the test case for allgather_noncontiguous_input has the same problem.
I tested allgather and the answer was correct. It seems allgather does not have such bug. I am not sure whether I should also update the test case for allgather_noncontiguous_input together or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the test!
I think the allgather test case should also be fixed, if you have the time you can do it in a follow up PR
…the Gloo backend.
|
I have edited the test case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…oo backend (#158903) Fixes #158902 Pull Request resolved: #158903 Approved by: https://github.com/H-Huang
…ous inputs in the Gloo backend (#159542) As suggested in the pull request #158903 by @H-Huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend. Pull Request resolved: #159542 Approved by: https://github.com/d4l3k, https://github.com/H-Huang
Fixes #158902
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta