-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[c10d] use a promise to delay watchdog shutdown #138828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138828
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e4c1e43 with merge base 10a34dc ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D64857928 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- would be nice to add a unit test if possible
|
This pull request was exported from Phabricator. Differential Revision: D64857928 |
a9ae572 to
186983f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
This pull request was exported from Phabricator. Differential Revision: D64857928 |
186983f to
07d6b4d
Compare
Summary: Pull Request resolved: pytorch#138828 We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs. This is resulting in Flight Recorder dumps missing for many jobs. This change: 1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes. 2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out its log on timeout. Test Plan: Tested on my local job and flight recorder successfully executed for the job. https://fburl.com/mlhub/38fj5yne The watchdog thread gives heartbeat thread time to write out the logs. In the logs we see: ``` [trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish. ``` Reviewed By: d4l3k Differential Revision: D64857928
|
This pull request was exported from Phabricator. Differential Revision: D64857928 |
07d6b4d to
e4c1e43
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchmergebot merge -f "stuck on macos job - force merging" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary:
We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs.
This change:
Test Plan:
Tested on my local job and flight recorder successfully executed for the job.
https://fburl.com/mlhub/38fj5yne
The watchdog thread gives heartbeat thread time to write out the logs.
In the logs we see:
Differential Revision: D64857928
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k