KEMBAR78
[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) by wconstab · Pull Request #113094 · pytorch/pytorch · GitHub
Skip to content

Conversation

wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 6, 2023

Stack from ghstack (oldest at bottom):

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.

…ss_group)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113094

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a68c9ad with merge base 75adb9f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Nov 6, 2023
…ss_group)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.

ghstack-source-id: 26fa84b
Pull Request resolved: #113094
asynchronously and the process will crash. ``NCCL_BLOCKING_WAIT``
will provide errors to the user which can be caught and handled,
but due to its blocking nature, it has a performance overhead. On
the other hand, ``NCCL_ASYNC_ERROR_HANDLING`` has very little
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with removing the document about NCCL_ASYNC_ERROR_HANDLING here. But shall we document it (maybe future) somewhere? I don't where is the best place to put them. Maybe we can have a debugging section for PTD?

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wconstab
Copy link
Contributor Author

wconstab commented Nov 7, 2023

Is this a replacement for https://github.com/pytorch/pytorch/pull/112893/files?

Yes this is a reland of 112893, it got reverted due to inductor-CI-cpu moco model failing. Basically that model is bad, it creates nccl PG for 1 process even when there is no cuda in the build, for CPU benchmarking. But I worked around it by warning instead of asserting.

@wconstab wconstab added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2023
@wconstab
Copy link
Contributor Author

wconstab commented Nov 7, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/wconstab/216/head branch November 10, 2023 15:24
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…ss_group) (pytorch#113094)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: pytorch#113094
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants