KEMBAR78
Fix a bug of distributed 'gather' with uncontiguous tensors on the Gloo backend by tiandeyu-cs · Pull Request #158903 · pytorch/pytorch · GitHub
Skip to content

Conversation

@tiandeyu-cs
Copy link
Contributor

@tiandeyu-cs tiandeyu-cs commented Jul 23, 2025

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158903

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0844cd0 with merge base 42a69f7 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 23, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 23, 2025
@soulitzer soulitzer requested review from H-Huang and d4l3k July 23, 2025 13:45
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 23, 2025
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, a test change would be nice as well

if (context_->rank == root) {
flatOutputTensor = newLikeFlat(outputs[0]);
GENERATE_ALL_TYPES(scalarType, setOutput, opts, flatOutputTensor);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you look into test_gather_noncontiguous_input

def test_gather_noncontiguous_input(self):
# Take a column of 2D tensor, such that memory is not dense
self._test_gather_basics(lambda t: t.expand(2, 2).contiguous()[:, 0])

run with python test/distributed/test_c10d_gloo.py -k test_gather_noncontiguous_input

Why doesn't the test cover this case and could you update it to test with something similar to the example you provided? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test case, and the answer was indeed correct and the test was passed.
The reason is that the tensors used in the test case are filled with same value, so gather can return correct results even if it did not gather the tensors properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I notice that the test case for allgather_noncontiguous_input has the same problem.
I tested allgather and the answer was correct. It seems allgather does not have such bug. I am not sure whether I should also update the test case for allgather_noncontiguous_input together or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the test!

I think the allgather test case should also be fixed, if you have the time you can do it in a follow up PR

@tiandeyu-cs
Copy link
Contributor Author

I have edited the test case test_gather_noncontiguous_input and it now can detect this bug.

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@H-Huang
Copy link
Member

H-Huang commented Jul 30, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 30, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

yangw-dev pushed a commit that referenced this pull request Aug 1, 2025
pytorchmergebot pushed a commit that referenced this pull request Aug 1, 2025
…ous inputs in the Gloo backend (#159542)

As suggested in the pull request #158903 by @H-Huang, this pull request edits a test case to detect potential bugs in all-gathering noncontiguous inputs in the Gloo backend.

Pull Request resolved: #159542
Approved by: https://github.com/d4l3k, https://github.com/H-Huang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed 'gather' with Gloo backend returns wrong results on uncontiguous tensors.

5 participants