KEMBAR78
[AOTI] Swith GPU codegen to one-pass by desertfire · Pull Request #141980 · pytorch/pytorch · GitHub
Skip to content

Conversation

@desertfire
Copy link
Contributor

@desertfire desertfire commented Dec 3, 2024

Stack from ghstack (oldest at bottom):

Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @chauhang @aakhundov

Differential Revision: D66739414

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 3, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141980

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 411d5cc with merge base 05c1f37 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@desertfire desertfire requested review from chenyang78 and removed request for chenyang78 December 3, 2024 22:56
[ghstack-poisoned]
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 4, 2024
[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 4, 2024
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

ghstack-source-id: 074d07e
Pull Request resolved: #141980
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 4, 2024
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

ghstack-source-id: 916ea46
Pull Request resolved: #141980
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@chenyang78 chenyang78 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Thanks!

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 9, 2024
Summary: Update multi-kernel codegen to one-pass, following #141980.

ghstack-source-id: 543ef77
Pull Request resolved: #142333
@desertfire
Copy link
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Dec 9, 2024
Summary: Update multi-kernel codegen to one-pass, following #141980.

Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717)
Pull Request resolved: #142333
Approved by: https://github.com/chenyang78
ghstack dependencies: #141980
facebook-github-bot pushed a commit that referenced this pull request Dec 12, 2024
Summary:

There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987
pytorchmergebot pushed a commit that referenced this pull request Dec 13, 2024
Summary: There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987

Pull Request resolved: #143098
Approved by: https://github.com/henrylhtsang
@github-actions github-actions bot deleted the gh/desertfire/516/head branch January 9, 2025 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants