KEMBAR78
Make init_process_group timeout kwarg override pg_options by wconstab · Pull Request #112611 · pytorch/pytorch · GitHub
Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 1, 2023

Stack from ghstack (oldest at bottom):

This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored. Make it sane and warn if 2 values are provided.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112611

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit e52c613 with merge base d084a02 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Nov 1, 2023
Comment on lines 1287 to 1291
if pg_options._timeout:
raise ValueError(
"pg_options._timeout was specified, "
"but timeout kwarg has a default value that will always override it. "
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this option only matter under NCCL? how does this interact w/ base_pg_options._timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, gloo/mpi paths actually use this other object called 'base_pg_options' which has its OWN timeout field. That's a whole 'nother problem, which i kinda want to try fixing too.

it looks like if someone passes in a 'pg_options' object, it only gets used for nccl. and if you use the gloo/mpi backends, the 'base_pg_options' thing gets used. pretty sketchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: if we do fix this, we might be blocked on the fact that ProcessGroupNCCL::Options is bound to python differently than ProcessGroup::Options. that is also something to fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason for the weirdness is that it always used to be init_pg -> creates 1 backend. Now init_pg -> can create multiple backends. However, pg_options in init_pg is only for NCCL and no other backends use it.

Hence all the other backends are using base_pg_options._timeout and this is not exposed to the user. We are just using the options as a property bag for the timeout value and backend name.

pg_options, ProcessGroupNCCL.Options
), "Expected pg_options argument to be of type ProcessGroupNCCL.Options"
if pg_options._timeout:
raise ValueError(
Copy link
Contributor

@fduwjj fduwjj Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to just throw a warning instead of Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, good point. If this discrepancy hasn't been bothering people, then they can take their time to fix their usage. No point introducing a bunch of breakages for people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think many people are using timeout in pg_options, so warning sounds good. If we really wanted to we could just save the timeout and then set it again after if it has a value.

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay with throwing a warning as Junjie proposed. Stamping to unblock

pg_options, ProcessGroupNCCL.Options
), "Expected pg_options argument to be of type ProcessGroupNCCL.Options"
if pg_options._timeout:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think many people are using timeout in pg_options, so warning sounds good. If we really wanted to we could just save the timeout and then set it again after if it has a value.

Comment on lines 1287 to 1291
if pg_options._timeout:
raise ValueError(
"pg_options._timeout was specified, "
"but timeout kwarg has a default value that will always override it. "
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason for the weirdness is that it always used to be init_pg -> creates 1 backend. Now init_pg -> can create multiple backends. However, pg_options in init_pg is only for NCCL and no other backends use it.

Hence all the other backends are using base_pg_options._timeout and this is not exposed to the user. We are just using the options as a property bag for the timeout value and backend name.

This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.

[ghstack-poisoned]
@wconstab wconstab changed the title WIP make timeout kwarg work consistently over pg_options Make init_process_group timeout kwarg override pg_options Nov 2, 2023
assert isinstance(
pg_options, ProcessGroupNCCL.Options
), "Expected pg_options argument to be of type ProcessGroupNCCL.Options"
if pg_options._timeout:
Copy link
Member

@H-Huang H-Huang Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check if pg_options._timeout != default timeout instead?

This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.

[ghstack-poisoned]
wconstab added a commit that referenced this pull request Nov 2, 2023
This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.

ghstack-source-id: 0a5abbb
Pull Request resolved: #112611
@wconstab
Copy link
Contributor Author

wconstab commented Nov 2, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 2, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wconstab
Copy link
Contributor Author

wconstab commented Nov 3, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

wconstab commented Nov 3, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Nov 4, 2023
Pull Request resolved: #112803
Approved by: https://github.com/H-Huang
ghstack dependencies: #112611
pytorchmergebot pushed a commit that referenced this pull request Nov 6, 2023
…#112893)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: #112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…2611)

This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.
Pull Request resolved: pytorch#112611
Approved by: https://github.com/H-Huang
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…pytorch#112893)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: pytorch#112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: pytorch#112611, pytorch#112803
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/209/head branch November 7, 2023 15:24
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…2611)

This used to be ambiguous but the pg_options._timeout value, if passed
in, is being ignored.  Make it sane and warn if 2 values are provided.
Pull Request resolved: pytorch#112611
Approved by: https://github.com/H-Huang
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…pytorch#112893)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: pytorch#112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: pytorch#112611, pytorch#112803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants