-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[AOTI] Swith GPU codegen to one-pass #141980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141980
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 411d5cc with merge base 05c1f37 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption. ghstack-source-id: 074d07e Pull Request resolved: #141980
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption. ghstack-source-id: 916ea46 Pull Request resolved: #141980
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Thanks!
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: Update multi-kernel codegen to one-pass, following #141980. Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717) Pull Request resolved: #142333 Approved by: https://github.com/chenyang78 ghstack dependencies: #141980
Summary: There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987
Summary: There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases. Reviewed By: henrylhtsang Differential Revision: D67120987 Pull Request resolved: #143098 Approved by: https://github.com/henrylhtsang
Stack from ghstack (oldest at bottom):
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @chauhang @aakhundov
Differential Revision: D66739414