[c10d] ProcessGroupGloo: support per operation timeouts #158128

d4l3k · 2025-07-11T17:28:31Z

This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.

This checks if the timeout is kUnsetTimeout and conditionally uses the provided timeout or the default timeout from the context.
This exposes set_timeout as a standard method on ProcessGroup/Backend so we can test the global timeout.

Test plan:

pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab

pytorch-bot · 2025-07-11T17:28:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158128

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2862077 with merge base b4476ca ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh) (similar failure)
'test/test_reductions.py::TestReductionsCPU::test_sum_all_cpu_float64'

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

H-Huang

Looks good! Was there any changes needed on the Gloo side? Or was this PR mostly to make sure the plumbing to gloo is correct

H-Huang · 2025-07-11T17:44:45Z

test/distributed/test_c10d_gloo.py

maybe add a test case where the collective timeout takes precedence over the pg timeout thats set?

done -- I did some cleanups and exposed set_timeout on ProcessGroup/Backend so we could test default operation timeout without causing issues with short timeout during init

H-Huang · 2025-07-11T17:55:16Z

torch/csrc/distributed/c10d/ProcessGroupGloo.cpp

where is the timeout from context_->getTimeout() coming from?

This is coming from the overall PG timeout that's set when the PG is created. There's a global timeout on the gloo Context

d4l3k · 2025-07-11T18:02:19Z

@H-Huang no Gloo side changes required, just a matter of plumbing things correctly. Was surprised to see that this wasn't plumbed correctly before

fduwjj

make sense and LGTM

d4l3k · 2025-07-11T19:39:11Z

@pytorchbot merge

pytorchmergebot · 2025-07-11T19:41:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-07-11T19:46:35Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / Link checks / lint-xrefs / linux-job

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

d4l3k · 2025-07-11T21:16:05Z

@pytorchbot merge

pytorchmergebot · 2025-07-11T21:18:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

d4l3k requested review from H-Huang, fduwjj and kwen2501 July 11, 2025 17:28

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 11, 2025

H-Huang approved these changes Jul 11, 2025

View reviewed changes

d4l3k force-pushed the d4l3k/gloo_timeouts branch from 1a20ccb to ff16b99 Compare July 11, 2025 18:08

fduwjj approved these changes Jul 11, 2025

View reviewed changes

d4l3k force-pushed the d4l3k/gloo_timeouts branch from ff16b99 to 8814ea0 Compare July 11, 2025 18:39

[c10d] ProcessGroupGloo: support per operation timeouts

2862077

d4l3k force-pushed the d4l3k/gloo_timeouts branch from 8814ea0 to 2862077 Compare July 11, 2025 19:38

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 11, 2025

pytorchmergebot added the merging label Jul 11, 2025

pytorchmergebot removed the merging label Jul 11, 2025

pytorchmergebot added the merging label Jul 11, 2025

pytorchmergebot closed this in 2a8795a Jul 11, 2025

pytorchmergebot added Merged and removed merging labels Jul 11, 2025

d4l3k deleted the d4l3k/gloo_timeouts branch July 12, 2025 00:02

[c10d] ProcessGroupGloo: support per operation timeouts #158128

[c10d] ProcessGroupGloo: support per operation timeouts #158128

Conversation

d4l3k commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158128

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k commented Jul 11, 2025

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k commented Jul 11, 2025

Uh oh!

pytorchmergebot commented Jul 11, 2025

Merge started

Uh oh!

pytorchmergebot commented Jul 11, 2025

Merge failed

Uh oh!

d4l3k commented Jul 11, 2025

Uh oh!

pytorchmergebot commented Jul 11, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

d4l3k commented Jul 11, 2025 •

edited

Loading

pytorch-bot bot commented Jul 11, 2025 •

edited

Loading