KEMBAR78
[Log] add a hook for recompile user context by BoyuanFeng · Pull Request #157961 · pytorch/pytorch · GitHub
Skip to content

Conversation

@BoyuanFeng
Copy link
Contributor

@BoyuanFeng BoyuanFeng commented Jul 9, 2025

Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

@BoyuanFeng BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: dynamo labels Jul 9, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 9, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157961

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit 2c27345 with merge base a3ec6d6 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

output_codes = Tracker()

initial_global_state: Optional[GlobalStateGuard] = None
recompile_user_contexts: Optional[list[Callable[[], str]]] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users may want log multiple different user contexts so use a list here

Copy link
Member

@williamwen42 williamwen42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test?

)
metrics_context.update_outer({"recompile_reason": recompile_reason})

if recompile_user_contexts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you only want to do this if it's a recompile? It looks like this is called unconditionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should only happen once for the first compile and once for every recompile. It should not happen for other cases. I added a test for that.

We probably don't want to distinguish the first compile and recompile here since other info (e.g., compile id) on dynamo_compile table tells us whether it's first compile or recompile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got fooled by the name. I assumed that you only wanted it called on the first recompile (but not the first compile).

metrics_context.update_outer({"recompile_reason": recompile_reason})

if recompile_user_contexts:
user_contexts_msg = "\n".join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "\n" is going to render very well in scuba. Maybe you want to separate by semi-colon instead (or some other separator). Alternatively, it's possible for scuba fields to be lists (called normvectors). I commented on the associated diff about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can use set[str] following other fields (e.g., compliant_custom_ops)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's another option. I think they're called "tags" or "tagsets" in scuba?? https://www.internalfb.com/wiki/Scuba/Quick_Start_Guide/Data_Types/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set becomes tagset here. Originally added by Richard in D50884828


if recompile_user_contexts:
user_contexts_msg = "\n".join(
[user_context() for user_context in recompile_user_contexts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of dangerous to allow an unbounded string here because if a user decides to add a very long string (by purpose or by mistake), it could cause retention problems. I wonder if there's a more restricted API we could use here. At the very least, we should probably truncate whatever they return to some maximum. Do you have a sense of how big these contexts need to be? I don't know what scuba recommends, but if we don't expect something very large then we could cap to 512, 1k, or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be short. One use case is just "training epoch: some_integer". But I am hesitating to add an arbitrarily selected cap to user message. Since users have full control of the log message, we should allow users to log what they want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh... I wouldn't give them complete freedom because the scuba table is a shared resource. So if one customer logs a ton of data, it affects retention for the whole table and we can't debug other jobs. I think it's probably fine to allow the field to be fairly large if you think it's needed. But I wouldn't allow it to be entirely unbounded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure we can cap each user context at 256 characters.

user_contexts_msg = "\n".join(
[user_context() for user_context in recompile_user_contexts]
)
metrics_context.set("recompile_user_contexts", user_contexts_msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the user context raises an exception? Do we just want to crash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since a user registers the context only for himself/herself, we should allow users to do whatever they want. It's like a user can raise an exception in other pytorch code.

user_contexts_msg = "\n".join(
[user_context() for user_context in recompile_user_contexts]
)
metrics_context.set("recompile_user_contexts", user_contexts_msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll also need to add a field to this thing: https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/utils.py#L1234

recompile_user_contexts: Optional[list[Callable[[], str]]] = None


def register_hook_for_recompile_user_context(hook: Callable[[], str]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Isn't this kind of a weird place to put this API? Most of the logging-related stuff is currently in _dynamo/utils.py. Maybe that would be a better location?

@facebook-github-bot
Copy link
Contributor

@BoyuanFeng has imported this pull request. If you are a Meta employee, you can view this in D78115274.

Copy link
Contributor

@masnesral masnesral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BoyuanFeng
Copy link
Contributor Author

@pytorchbot merge -f "skip unrelated fail"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Jul 14, 2025
Summary:
Users may want compile-related but customized logging info to dynamo_compile. One example is to logging the current training iteration index when recompilation happens. In general, current training iteration index is not available to compiler, since the same compiled function may be called multiple times in the same training iteration. The user could provide the training iteration index in a user hook where torch.compile logs it when recompilation happens.

X-link: pytorch/pytorch#157961
Approved by: https://github.com/masnesral

Reviewed By: masnesral

Differential Revision: D78115274

fbshipit-source-id: 6e2023146772734515a356139aaac9263a9ece2a
@github-actions github-actions bot deleted the bf/recompile-context branch August 11, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants