-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[AOTInductor] Update performance benchmark code #109560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109560
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 83eaaec with merge base b91ba22 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
35165da
to
827bf15
Compare
torch/_inductor/utils.py
Outdated
void run( | ||
std::vector<at::Tensor>& input_tensors, | ||
std::vector<at::Tensor>& output_tensors) { | ||
class AOTInductorModelContainer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a different name as this can cause confusion with
using AOTInductorModelContainerHandle = AOTInductorModelContainerOpaque*; |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
b89ce6b
to
83eaaec
Compare
@angelayi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Same as #109560, made a new PR because we need to land from internal Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after #108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup. This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs. For example, ``` python benchmarks/dynamo/huggingface.py --performance --cold-start-latency --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --total-partitions 3 --partition-id 0 --only AlbertForMaskedLM ``` results in `1.359x` speedup. Specifically, this adds a `create_container_handle` and `delete_container_handle` function which need to called before `run`. We call `create_container_handle` to initialize the AOTInductorModelContainerHandle, call `run` to run the compiled .so with different inputs, and then `delete_container_handle` to delete it. [Updated dashboard results](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2013%20Sep%202023%2021%3A03%3A55%20GMT&stopTime=Wed%2C%2020%20Sep%202023%2021%3A03%3A55%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/aot_inductor_benchmark&lCommit=f9aa49c4c9a1a140b6f0c4520d1d6d99b57e12fa&rBranch=main&rCommit=015be4cedba357eb931e24bf188479235db7c5c8) Test Plan: CI Differential Revision: D49513934 Pull Request resolved: #109820 Approved by: https://github.com/desertfire
Summary: Change AOTInductor to directly return output tensors instead of taking pre-allocated output tensors to return the results. This gives several benefits: * It makes sure AOTInductor has the same behavior when managing the output tensors as the default Inductor, which is widely tested and thus more reliable. * As we have debugged before, there are cases we still have to codegen extra copy_ ops to fill the pre-allocated output tensors which doesn't make sense for performance. * With the coming enhanced memory planning, this again will make sure the memory planning logic is the between AOTInductor and Inductor, which will greatly simplify the problem and improve the reliability. This change also combines D49494954 from Yang and #109560 from Angela. Differential Revision: D49502318 Pull Request resolved: #109790 Approved by: https://github.com/chenyang78
Previously during performance benchmark testing, we would create an AOTInductorModelContainerHandle every time the compiled function is run with new inputs. However after #108473 we now load the constants needed in the runtime when initializing the AOTInductorModelContainerHandle. This resulted in our benchmarks displaying a ~0.4x speedup.
This diff moves the initialization of AOTInductorModelContainerHandle outside of the code where we run the compiled function with different inputs.
For example,
results in
1.359x
speedup.Specifically, this adds a
create_container_handle
anddelete_container_handle
function which need to called beforerun
. We callcreate_container_handle
to initialize the AOTInductorModelContainerHandle, callrun
to run the compiled .so with different inputs, and thendelete_container_handle
to delete it.Updated dashboard results
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov