[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094

wconstab · 2023-11-06T23:36:12Z

Stack from ghstack (oldest at bottom):

-> [Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.

…ss_group) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. [ghstack-poisoned]

pytorch-bot · 2023-11-06T23:36:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113094

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a68c9ad with merge base 75adb9f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ss_group) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. ghstack-source-id: 26fa84b Pull Request resolved: #113094

fduwjj · 2023-11-07T01:16:40Z

torch/distributed/distributed_c10d.py

-            asynchronously and the process will crash. ``NCCL_BLOCKING_WAIT``
-            will provide errors to the user which can be caught and handled,
-            but due to its blocking nature, it has a performance overhead. On
-            the other hand, ``NCCL_ASYNC_ERROR_HANDLING`` has very little


I am ok with removing the document about NCCL_ASYNC_ERROR_HANDLING here. But shall we document it (maybe future) somewhere? I don't where is the best place to put them. Maybe we can have a debugging section for PTD?

fduwjj

Is this a replacement for https://github.com/pytorch/pytorch/pull/112893/files?

wconstab · 2023-11-07T01:29:04Z

Is this a replacement for https://github.com/pytorch/pytorch/pull/112893/files?

Yes this is a reland of 112893, it got reverted due to inductor-CI-cpu moco model failing. Basically that model is bad, it creates nccl PG for 1 process even when there is no cuda in the build, for CPU benchmarking. But I worked around it by warning instead of asserting.

wconstab · 2023-11-07T05:30:33Z

@pytorchbot merge

pytorchmergebot · 2023-11-07T05:33:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ss_group) (pytorch#113094) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: pytorch#113094 Approved by: https://github.com/fduwjj

wconstab requested review from H-Huang, LucasLLC, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners November 6, 2023 23:36

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Nov 6, 2023

wconstab added the ciflow/inductor label Nov 6, 2023

fduwjj reviewed Nov 7, 2023

View reviewed changes

fduwjj approved these changes Nov 7, 2023

View reviewed changes

wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2023

pytorchmergebot added the merging label Nov 7, 2023

pytorchmergebot added Merged and removed merging labels Nov 7, 2023

pytorchmergebot closed this in ff51f94 Nov 7, 2023

facebook-github-bot deleted the gh/wconstab/216/head branch November 10, 2023 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094

[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094

Uh oh!

wconstab commented Nov 6, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 6, 2023 •

edited

Loading

Uh oh!

fduwjj Nov 7, 2023

Uh oh!

fduwjj left a comment

Uh oh!

wconstab commented Nov 7, 2023

Uh oh!

wconstab commented Nov 7, 2023

Uh oh!

pytorchmergebot commented Nov 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094

[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) #113094

Uh oh!

Conversation

wconstab commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113094

✅ No Failures

Uh oh!

fduwjj Nov 7, 2023

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Nov 7, 2023

Uh oh!

wconstab commented Nov 7, 2023

Uh oh!

pytorchmergebot commented Nov 7, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wconstab commented Nov 6, 2023 •

edited

Loading

pytorch-bot bot commented Nov 6, 2023 •

edited

Loading