KEMBAR78
codecache: pull out some Graph serialization code into common helpers by aorenste · Pull Request #141502 · pytorch/pytorch · GitHub
Skip to content

Conversation

@aorenste
Copy link
Contributor

@aorenste aorenste commented Nov 25, 2024

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141502

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 530614a with merge base 4959784 (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aorenste added a commit that referenced this pull request Nov 25, 2024
ghstack-source-id: 47f3ac3
Pull Request resolved: #141502
@aorenste aorenste changed the title WIP: prepare_for_serialization codecache: make some Graph serialization common Nov 25, 2024
@aorenste aorenste changed the title codecache: make some Graph serialization common codecache: pull out some Graph serialization code into common helpers Nov 25, 2024
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@aorenste aorenste added the topic: not user facing topic category label Nov 25, 2024
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@aorenste aorenste requested a review from oulgen November 26, 2024 17:31
@aorenste aorenste marked this pull request as ready for review November 26, 2024 17:31
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
…mon helpers"

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
@jamesjwu jamesjwu requested a review from ezyang December 3, 2024 17:08
# models to disk.
self.current_callable = None

def after_deserialization(self, gm: Optional[torch.fx.GraphModule]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang , this is basically just pulling out the code you wanted to refactor into post_compile, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also qq (maybe for @aorenste) do we do this serialization even when the outputcode is generated via AOTI (not CompiledFxGraph)? I.e. does it belong in OutputCode, or does it belong in CompiledFxGraph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we want this separate from post_compile (there's probably a good reason, I'm just not seeing it). I am not sure why this isn't a shared method in the top level protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may not be a good reason. I did all this code before post_compile/OutputCode existed - so it may make sense to merge it all together better.

Copy link
Contributor

@jamesjwu jamesjwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual purpose of the PR seems good, but we may want to change/bikeshed some names around to figure out the order these functions should be called.

I think it's confusing to both ahve a post_compile step and a after_deserialization step, but I also think after_deserialization is a better name. I guess concretely I think, assuming this after_deserialization always runs on all types of OutputCodes:

  • Define OutputCode.post_serialize to be this after_deserialization function
  • Rename CompiledFxGraph.post_compile to CompiledFxGraph.post_serialize, and have CompiledFxGraph inherit from OutputCode's definition by calling OutputCode first.

That said, today CompiledFxGraph isn't even an actual child class of CompiledFxGraph. Not sure what's stopping us from making that happen though. Will wait for @ezyang 's thoughts.

except OSError:
# Not expected, but in case the PyCodeCache entry is removed from
# underneath us, treat it as a cache miss and recompile.
log.error("Failed to load cached artifact: %s", artifact_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the log line seems worse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log still happens - just inside after_deserialization which will raise an exception (because it doesn't know if ignoring the exception is the right thing or not).

get_metrics_context().increment("num_triton_bundles", 1)

inductor_meta = autotune_cache.inductor_meta_from_config()
AutotuneCacheBundler.begin_compile(inductor_meta, code=code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this after the OSError catch looks like a bug fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really. The problem is that in order to look up the autotune cache we need the artifact code - but we don't have the artifact code until after after_deserialization() is called. AutotuneCacheBundler.begin_compile() just needs to get called sometime after you have the artifact code and before the first autotune occurs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug fix is having begun the compile and then immediately erroring out (without ending compile)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying. Yeah - the AutotuneCacheBundler is a weird thing because our compile doesn't really have a good place to "wrap" (it would have to be somewhere in the dynamo call bytecode handler). begin_compile really just means "go fetch the compiled artifacts and spit them out on disk as local artifacts". end_compile really just means "collect all the local artifacts and bundle them into a single remote artifact". So not having an end doesn't really mean anything bad - it's supposed to handle that case properly (since it can happen in lots of ways).

# so we serialize their PyCodeCache disk cache location instead.
# TODO: This could be better if we're ever able to serialize compiled
# models to disk.
self.current_callable = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK for refactoring but I don't want a "mutate the object so I can serialize it" API to be the final way of doing this. Definitely want a better API.

from .graph import GraphLowering

# This is used by tests to check the output for specific details.
GraphLowering.save_output_code(code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was moved here

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like forward progress, probably need to work on this code more though

pytorchmergebot pushed a commit that referenced this pull request Dec 3, 2024
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation.
- Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior.
- Add an async interface. (See below)

The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that.  An async function is just a syntactic way to return an asyncio.Future.

Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called.

Pull Request resolved: #141505
Approved by: https://github.com/ezyang
ghstack dependencies: #141502
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…pytorch#141502)

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.

Pull Request resolved: pytorch#141502
Approved by: https://github.com/ezyang
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation.
- Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior.
- Add an async interface. (See below)

The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that.  An async function is just a syntactic way to return an asyncio.Future.

Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called.

Pull Request resolved: pytorch#141505
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#141502
AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
…pytorch#141502)

Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.

Pull Request resolved: pytorch#141502
Approved by: https://github.com/ezyang
AmdSampsa pushed a commit to AmdSampsa/pytorch that referenced this pull request Dec 9, 2024
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation.
- Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior.
- Add an async interface. (See below)

The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that.  An async function is just a syntactic way to return an asyncio.Future.

Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called.

Pull Request resolved: pytorch#141505
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#141502
@github-actions github-actions bot deleted the gh/aorenste/145/head branch January 3, 2025 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants