-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] Fix setGroupName and setGroupDesc in group_split
and merge_remote_group
#159429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159429
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit 0520b21 with merge base 1465757 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D79201132 |
…remote_group` (pytorch#159429) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI ``` buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx_persistent -- DeviceMeshTest ``` ``` buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge ``` Rollback Plan: Differential Revision: D79201132
This pull request was exported from Phabricator. Differential Revision: D79201132 |
…remote_group` (pytorch#159429) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI ``` buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx_persistent -- DeviceMeshTest ``` ``` buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge ``` Rollback Plan: Differential Revision: D79201132
This pull request was exported from Phabricator. Differential Revision: D79201132 |
LG. Thanks for the fix! |
…remote_group` (pytorch#159429) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Reviewed By: xunnanxu Differential Revision: D79201132
This pull request was exported from Phabricator. Differential Revision: D79201132 |
…remote_group` (pytorch#159429) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when `split_group` is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Reviewed By: xunnanxu Differential Revision: D79201132
This pull request was exported from Phabricator. Differential Revision: D79201132 |
group_split
and merge_remote_group
The inductor test failed with "RuntimeError: Expected to find "buf7 = empty" but did not find it" which is not related to this PR. We don't interfere with how Triton kernel is generated and split_group is not used there as well. |
@pytorchbot merge -f "failure test not related" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…remote_group` (#159429) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Rollback Plan: Differential Revision: D79201132 Pull Request resolved: #159429 Approved by: https://github.com/xunnanxu
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to
deviceTypeToBackend_
which is set aftersetBackend
. Same thing as group_desc. I added more unit tests for it.We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh
Also ncclx needs to be aware of that its Option is a subclass of BackendOption
Test Plan:
CI
Rollback Plan:
Differential Revision: D79201132
cc @H-Huang @awgu @wanchaol @fegin @wz337 @wconstab @d4l3k @pragupta