KEMBAR78
[pgnccl][simple] log started work numel by shuqiangzhang · Pull Request #139773 · pytorch/pytorch · GitHub
Skip to content

Conversation

@shuqiangzhang
Copy link
Contributor

@shuqiangzhang shuqiangzhang commented Nov 5, 2024

Stack from ghstack (oldest at bottom):

Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139773

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c554bdd with merge base 546318e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 5, 2024
shuqiangzhang added a commit that referenced this pull request Nov 5, 2024
Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

ghstack-source-id: fce040a
Pull Request resolved: #139773
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revisit at some point to see how we can structure things better.

@shuqiangzhang
Copy link
Contributor Author

@pytorchbot merge -f "unrelated failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

Pull Request resolved: pytorch#139773
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
@github-actions github-actions bot deleted the gh/shuqiangzhang/60/head branch December 6, 2024 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants