-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Make init_process_group timeout kwarg override pg_options #112611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112611
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit e52c613 with merge base d084a02 ( UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| if pg_options._timeout: | ||
| raise ValueError( | ||
| "pg_options._timeout was specified, " | ||
| "but timeout kwarg has a default value that will always override it. " | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this option only matter under NCCL? how does this interact w/ base_pg_options._timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, gloo/mpi paths actually use this other object called 'base_pg_options' which has its OWN timeout field. That's a whole 'nother problem, which i kinda want to try fixing too.
it looks like if someone passes in a 'pg_options' object, it only gets used for nccl. and if you use the gloo/mpi backends, the 'base_pg_options' thing gets used. pretty sketchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: if we do fix this, we might be blocked on the fact that ProcessGroupNCCL::Options is bound to python differently than ProcessGroup::Options. that is also something to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason for the weirdness is that it always used to be init_pg -> creates 1 backend. Now init_pg -> can create multiple backends. However, pg_options in init_pg is only for NCCL and no other backends use it.
Hence all the other backends are using base_pg_options._timeout and this is not exposed to the user. We are just using the options as a property bag for the timeout value and backend name.
| pg_options, ProcessGroupNCCL.Options | ||
| ), "Expected pg_options argument to be of type ProcessGroupNCCL.Options" | ||
| if pg_options._timeout: | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to just throw a warning instead of Error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, good point. If this discrepancy hasn't been bothering people, then they can take their time to fix their usage. No point introducing a bunch of breakages for people.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think many people are using timeout in pg_options, so warning sounds good. If we really wanted to we could just save the timeout and then set it again after if it has a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay with throwing a warning as Junjie proposed. Stamping to unblock
| pg_options, ProcessGroupNCCL.Options | ||
| ), "Expected pg_options argument to be of type ProcessGroupNCCL.Options" | ||
| if pg_options._timeout: | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think many people are using timeout in pg_options, so warning sounds good. If we really wanted to we could just save the timeout and then set it again after if it has a value.
| if pg_options._timeout: | ||
| raise ValueError( | ||
| "pg_options._timeout was specified, " | ||
| "but timeout kwarg has a default value that will always override it. " | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason for the weirdness is that it always used to be init_pg -> creates 1 backend. Now init_pg -> can create multiple backends. However, pg_options in init_pg is only for NCCL and no other backends use it.
Hence all the other backends are using base_pg_options._timeout and this is not exposed to the user. We are just using the options as a property bag for the timeout value and backend name.
[ghstack-poisoned]
This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. [ghstack-poisoned]
| assert isinstance( | ||
| pg_options, ProcessGroupNCCL.Options | ||
| ), "Expected pg_options argument to be of type ProcessGroupNCCL.Options" | ||
| if pg_options._timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check if pg_options._timeout != default timeout instead?
This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 11 jobs have failed, first few of them are: pull / linux-focal-py3.8-clang10 / test (default, 2, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (crossref, 2, 2, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 2, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (default, 2, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (crossref, 1, 2, linux.2xlarge) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 14 jobs have failed, first few of them are: pull / linux-focal-py3.8-clang10 / test (default, 2, 3, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (crossref, 2, 2, linux.2xlarge), pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 2, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (default, 2, 3, linux.2xlarge), pull / linux-focal-py3.11-clang10 / test (crossref, 1, 2, linux.2xlarge) Details for Dev Infra teamRaised by workflow job |
This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Pull Request resolved: #112803 Approved by: https://github.com/H-Huang ghstack dependencies: #112611
…#112893) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: #112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: #112611, #112803
…2611) This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. Pull Request resolved: pytorch#112611 Approved by: https://github.com/H-Huang
Pull Request resolved: pytorch#112803 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#112611
…pytorch#112893) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: pytorch#112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: pytorch#112611, pytorch#112803
…2611) This used to be ambiguous but the pg_options._timeout value, if passed in, is being ignored. Make it sane and warn if 2 values are provided. Pull Request resolved: pytorch#112611 Approved by: https://github.com/H-Huang
Pull Request resolved: pytorch#112803 Approved by: https://github.com/H-Huang ghstack dependencies: pytorch#112611
…pytorch#112893) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: pytorch#112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: pytorch#112611, pytorch#112803
Stack from ghstack (oldest at bottom):
This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored. Make it sane and warn if 2 values are provided.