KEMBAR78
[PGNCCL] Make sure we do not use split for P2P comm creation by kwen2501 · Pull Request #139013 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Oct 27, 2024

Stack from ghstack (oldest at bottom):

Resolve comment #138527 (comment)

There was a split-vs-P2P bug:
When P2P comm creation invokes getNCCLComm, it may see a split_from options which is meant for the previous PG creation. Then the P2P comm creation may use ncclCommSplit and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 27, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139013

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3717d2a with merge base 3a6f014 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 27, 2024
kwen2501 added a commit that referenced this pull request Oct 27, 2024
@kwen2501 kwen2501 added topic: bug fixes topic category ciflow/trunk Trigger trunk jobs on your pull request labels Oct 27, 2024
@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ZainRizvi
Copy link
Contributor

@pytorchmergebot revert -c nosignal -m "Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False GH job link HUD commit link"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Oct 28, 2024
…139013)"

This reverts commit 74878ac.

Reverted #139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](#139013 (comment)))
@pytorchmergebot
Copy link
Collaborator

@kwen2501 your PR has been successfully reverted.

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Oct 29, 2024
…ytorch#139013)"

This reverts commit 74878ac.

Reverted pytorch#139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](pytorch#139013 (comment)))
@kwen2501
Copy link
Contributor Author

kwen2501 commented Nov 2, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kwen2501
Copy link
Contributor Author

kwen2501 commented Nov 2, 2024

@pytorchmergebot revert -c nosignal -m "More flavor of test_manual_with_data_parallel failed"

@kwen2501 kwen2501 reopened this Nov 2, 2024
@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 139013 failed

Reason: list index out of range

Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Contributor Author

kwen2501 commented Nov 2, 2024

@pytorchmergebot revert -c nosignal -m "More flavor of test_manual_with_data_parallel failed"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 139013 failed

Reason: list index out of range

Details for Dev Infra team Raised by workflow job

@kwen2501 kwen2501 closed this Nov 2, 2024
@kwen2501
Copy link
Contributor Author

kwen2501 commented Nov 2, 2024

@pytorchmergebot revert -c nosignal -m "More flavor of test_manual_with_data_parallel failed"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@kwen2501 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Nov 2, 2024
…139013)"

This reverts commit 55038aa.

Reverted #139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](#139013 (comment)))
@pytorchmergebot pytorchmergebot added the ci-no-td Do not run TD on this PR label Nov 2, 2024
@kwen2501 kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Nov 2, 2024
Resolve comment #138527 (comment)

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 2, 2024
Also moved test_manual_with_data_parallel to use lazy init to unblock
this bug fix

ghstack-source-id: 98743fb
Pull Request resolved: #139013
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…ytorch#139013)"

This reverts commit 74878ac.

Reverted pytorch#139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](pytorch#139013 (comment)))
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…#139013)

Resolve comment pytorch#138527 (comment)

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: pytorch#139013
Approved by: https://github.com/shuqiangzhang
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…ytorch#139013)"

This reverts commit 55038aa.

Reverted pytorch#139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](pytorch#139013 (comment)))
@kwen2501
Copy link
Contributor Author

kwen2501 commented Dec 9, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/kwen2501/88/head branch January 9, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted topic: bug fixes topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants