-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add test for init_process_group timeout #112803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112803
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b908a59 with merge base d084a02 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for adding
| # self.assertTrue("pg_options._timeout was specified" in str(w[-1].message)) | ||
| _check_nccl_timeout(torch.distributed.distributed_c10d.default_pg_timeout) | ||
| dist.destroy_process_group() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could have 1 more example doing init_process_group using both options and timeout args, as pyper does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
[ghstack-poisoned]
[ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…#112893) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: #112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: #112611, #112803
Pull Request resolved: pytorch#112803 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#112611
…pytorch#112893) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: pytorch#112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: pytorch#112611, pytorch#112803
Pull Request resolved: pytorch#112803 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#112611
…pytorch#112893) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: pytorch#112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: pytorch#112611, pytorch#112803
Stack from ghstack (oldest at bottom):