[c10d] use a promise to delay watchdog shutdown #138828

c-p-i-o · 2024-10-24T16:55:22Z

Summary:
We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs.
This change:

Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes.
Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout.

Test Plan:
Tested on my local job and flight recorder successfully executed for the job.
https://fburl.com/mlhub/38fj5yne
The watchdog thread gives heartbeat thread time to write out the logs.

In the logs we see:

[trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish.

Differential Revision: D64857928

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

pytorch-bot · 2024-10-24T16:55:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138828

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e4c1e43 with merge base 10a34dc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-24T16:55:40Z

This pull request was exported from Phabricator. Differential Revision: D64857928

d4l3k

LGTM -- would be nice to add a unit test if possible

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

facebook-github-bot · 2024-10-24T18:57:51Z

This pull request was exported from Phabricator. Differential Revision: D64857928

fduwjj

LGTM

facebook-github-bot · 2024-10-24T20:00:27Z

This pull request was exported from Phabricator. Differential Revision: D64857928

Summary: Pull Request resolved: pytorch#138828 We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs. This is resulting in Flight Recorder dumps missing for many jobs. This change: 1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes. 2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out its log on timeout. Test Plan: Tested on my local job and flight recorder successfully executed for the job. https://fburl.com/mlhub/38fj5yne The watchdog thread gives heartbeat thread time to write out the logs. In the logs we see: ``` [trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish. ``` Reviewed By: d4l3k Differential Revision: D64857928

facebook-github-bot · 2024-10-24T20:09:16Z

This pull request was exported from Phabricator. Differential Revision: D64857928

facebook-github-bot · 2024-10-24T23:04:41Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-10-24T23:06:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

wdvr · 2024-10-24T23:40:46Z

@pytorchmergebot merge -f "stuck on macos job - force merging"

pytorchmergebot · 2024-10-24T23:41:04Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-10-24T23:42:21Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 24, 2024

facebook-github-bot added the fb-exported label Oct 24, 2024

d4l3k approved these changes Oct 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 24, 2024

fduwjj reviewed Oct 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

fduwjj reviewed Oct 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

fduwjj reviewed Oct 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved

fduwjj reviewed Oct 24, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved

c-p-i-o force-pushed the export-D64857928 branch from a9ae572 to 186983f Compare October 24, 2024 18:57

fduwjj approved these changes Oct 24, 2024

View reviewed changes

c-p-i-o force-pushed the export-D64857928 branch from 186983f to 07d6b4d Compare October 24, 2024 20:00

c-p-i-o force-pushed the export-D64857928 branch from 07d6b4d to e4c1e43 Compare October 24, 2024 20:09

pytorchmergebot added the merging label Oct 24, 2024

pytorchmergebot closed this in 425ce2a Oct 24, 2024

pytorchmergebot added Merged and removed merging labels Oct 24, 2024

[c10d] use a promise to delay watchdog shutdown #138828

[c10d] use a promise to delay watchdog shutdown #138828

Uh oh!

Conversation

c-p-i-o commented Oct 24, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138828

✅ No Failures

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

facebook-github-bot commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

wdvr commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Uh oh!

pytorchmergebot commented Oct 24, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

c-p-i-o commented Oct 24, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 24, 2024 •

edited

Loading