-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[PGNCCL] Make sure we do not use split for P2P comm creation #139013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139013
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 3717d2a with merge base 3a6f014 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchmergebot revert -c nosignal -m "Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False GH job link HUD commit link" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…139013)" This reverts commit 74878ac. Reverted #139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](#139013 (comment)))
|
@kwen2501 your PR has been successfully reverted. |
…ytorch#139013)" This reverts commit 74878ac. Reverted pytorch#139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](pytorch#139013 (comment)))
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchmergebot revert -c nosignal -m "More flavor of test_manual_with_data_parallel failed" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 139013 failedReason: list index out of range Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot revert -c nosignal -m "More flavor of test_manual_with_data_parallel failed" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 139013 failedReason: list index out of range Details for Dev Infra teamRaised by workflow job |
|
@pytorchmergebot revert -c nosignal -m "More flavor of test_manual_with_data_parallel failed" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@kwen2501 your PR has been successfully reverted. |
…139013)" This reverts commit 55038aa. Reverted #139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](#139013 (comment)))
Resolve comment #138527 (comment) There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ytorch#139013)" This reverts commit 74878ac. Reverted pytorch#139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/74878ac271feecfa3ff3d32f78c7d889bcac97d6) ([comment](pytorch#139013 (comment)))
…#139013) Resolve comment pytorch#138527 (comment) There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: pytorch#139013 Approved by: https://github.com/shuqiangzhang
…ytorch#139013)" This reverts commit 55038aa. Reverted pytorch#139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](pytorch#139013 (comment)))
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Resolve comment #138527 (comment)
There was a split-vs-P2P bug:
When P2P comm creation invokes
getNCCLComm, it may see asplit_fromoptions which is meant for the previous PG creation. Then the P2P comm creation may usencclCommSplitand hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o