-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Log] add a hook for recompile user context #157961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157961
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled Job, 1 Unrelated FailureAs of commit 2c27345 with merge base a3ec6d6 ( CANCELLED JOB - The following job was cancelled. Please retry:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_dynamo/convert_frame.py
Outdated
| output_codes = Tracker() | ||
|
|
||
| initial_global_state: Optional[GlobalStateGuard] = None | ||
| recompile_user_contexts: Optional[list[Callable[[], str]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users may want log multiple different user contexts so use a list here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test?
| ) | ||
| metrics_context.update_outer({"recompile_reason": recompile_reason}) | ||
|
|
||
| if recompile_user_contexts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you only want to do this if it's a recompile? It looks like this is called unconditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should only happen once for the first compile and once for every recompile. It should not happen for other cases. I added a test for that.
We probably don't want to distinguish the first compile and recompile here since other info (e.g., compile id) on dynamo_compile table tells us whether it's first compile or recompile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got fooled by the name. I assumed that you only wanted it called on the first recompile (but not the first compile).
torch/_dynamo/convert_frame.py
Outdated
| metrics_context.update_outer({"recompile_reason": recompile_reason}) | ||
|
|
||
| if recompile_user_contexts: | ||
| user_contexts_msg = "\n".join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "\n" is going to render very well in scuba. Maybe you want to separate by semi-colon instead (or some other separator). Alternatively, it's possible for scuba fields to be lists (called normvectors). I commented on the associated diff about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably can use set[str] following other fields (e.g., compliant_custom_ops)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's another option. I think they're called "tags" or "tagsets" in scuba?? https://www.internalfb.com/wiki/Scuba/Quick_Start_Guide/Data_Types/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch/_dynamo/convert_frame.py
Outdated
|
|
||
| if recompile_user_contexts: | ||
| user_contexts_msg = "\n".join( | ||
| [user_context() for user_context in recompile_user_contexts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of dangerous to allow an unbounded string here because if a user decides to add a very long string (by purpose or by mistake), it could cause retention problems. I wonder if there's a more restricted API we could use here. At the very least, we should probably truncate whatever they return to some maximum. Do you have a sense of how big these contexts need to be? I don't know what scuba recommends, but if we don't expect something very large then we could cap to 512, 1k, or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be short. One use case is just "training epoch: some_integer". But I am hesitating to add an arbitrarily selected cap to user message. Since users have full control of the log message, we should allow users to log what they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh... I wouldn't give them complete freedom because the scuba table is a shared resource. So if one customer logs a ton of data, it affects retention for the whole table and we can't debug other jobs. I think it's probably fine to allow the field to be fairly large if you think it's needed. But I wouldn't allow it to be entirely unbounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure we can cap each user context at 256 characters.
| user_contexts_msg = "\n".join( | ||
| [user_context() for user_context in recompile_user_contexts] | ||
| ) | ||
| metrics_context.set("recompile_user_contexts", user_contexts_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the user context raises an exception? Do we just want to crash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Since a user registers the context only for himself/herself, we should allow users to do whatever they want. It's like a user can raise an exception in other pytorch code.
| user_contexts_msg = "\n".join( | ||
| [user_context() for user_context in recompile_user_contexts] | ||
| ) | ||
| metrics_context.set("recompile_user_contexts", user_contexts_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you'll also need to add a field to this thing: https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/utils.py#L1234
torch/_dynamo/convert_frame.py
Outdated
| recompile_user_contexts: Optional[list[Callable[[], str]]] = None | ||
|
|
||
|
|
||
| def register_hook_for_recompile_user_context(hook: Callable[[], str]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Isn't this kind of a weird place to put this API? Most of the logging-related stuff is currently in _dynamo/utils.py. Maybe that would be a better location?
|
@BoyuanFeng has imported this pull request. If you are a Meta employee, you can view this in D78115274. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@pytorchbot merge -f "skip unrelated fail" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens. X-link: pytorch/pytorch#157961 Approved by: https://github.com/masnesral Reviewed By: masnesral Differential Revision: D78115274 fbshipit-source-id: 6e2023146772734515a356139aaac9263a9ece2a
Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames