-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ss_group) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113094
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a68c9ad with merge base 75adb9f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…ss_group) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. ghstack-source-id: 26fa84b Pull Request resolved: #113094
| asynchronously and the process will crash. ``NCCL_BLOCKING_WAIT`` | ||
| will provide errors to the user which can be caught and handled, | ||
| but due to its blocking nature, it has a performance overhead. On | ||
| the other hand, ``NCCL_ASYNC_ERROR_HANDLING`` has very little |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with removing the document about NCCL_ASYNC_ERROR_HANDLING here. But shall we document it (maybe future) somewhere? I don't where is the best place to put them. Maybe we can have a debugging section for PTD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a replacement for https://github.com/pytorch/pytorch/pull/112893/files?
Yes this is a reland of 112893, it got reverted due to inductor-CI-cpu moco model failing. Basically that model is bad, it creates nccl PG for 1 process even when there is no cuda in the build, for CPU benchmarking. But I worked around it by warning instead of asserting. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ss_group) (pytorch#113094) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: pytorch#113094 Approved by: https://github.com/fduwjj
Stack from ghstack (oldest at bottom):
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.
This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.