-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[PGNCCL] Deprecate suppport of onCompletionHook #142390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142390
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 6fee2fa with merge base db13bd9 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @mrshenli
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
isn't the onCompletionHook sort of orthogonal to the flight-recorder? The flight recorder today does not have its own thread, so if a user wants to register some kind of callback it is reasonable to assume that we'd have to have a thread to do this. Potentially OK to deprecate this but only if we're sure we don't want to add it back soon. Otoh as we aim to generalize the FR infra, maybe its still better to move the thread into that general infra instead of ProcessGroupNCCL. |
True, except they have one thing in common: call site = on completion.
The thing is I can't think of anything else than profiling or status book-keeping a user may want to do for all work, in the background. Definitely not computation on the tensor, because the user would not be able to know if it completes or consume the result if it is run in the background. |
The usage of `onCompletionHook` is mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database". Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code) User can file an issue if additional information needs to be recorded. They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording. Pull Request resolved: pytorch#142390 Approved by: https://github.com/fduwjj, https://github.com/fegin
Stack from ghstack (oldest at bottom):
The usage of
onCompletionHookis mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database".Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code)
User can file an issue if additional information needs to be recorded.
They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o