KEMBAR78
Add test for init_process_group timeout by wconstab · Pull Request #112803 · pytorch/pytorch · GitHub
Skip to content

Conversation

@wconstab
Copy link
Contributor

@wconstab wconstab commented Nov 2, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 2, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112803

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b908a59 with merge base d084a02 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab added a commit that referenced this pull request Nov 2, 2023
ghstack-source-id: 2c76c6f
Pull Request resolved: #112803
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding

# self.assertTrue("pg_options._timeout was specified" in str(w[-1].message))
_check_nccl_timeout(torch.distributed.distributed_c10d.default_pg_timeout)
dist.destroy_process_group()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could have 1 more example doing init_process_group using both options and timeout args, as pyper does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@wconstab
Copy link
Contributor Author

wconstab commented Nov 4, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 4, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Nov 6, 2023
…#112893)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: #112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…pytorch#112893)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: pytorch#112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: pytorch#112611, pytorch#112803
@facebook-github-bot facebook-github-bot deleted the gh/wconstab/210/head branch November 8, 2023 15:25
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…pytorch#112893)

Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: pytorch#112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: pytorch#112611, pytorch#112803
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants