KEMBAR78

[AOTI] Swith GPU codegen to one-pass by desertfire · Pull Request #141980 · pytorch/pytorch · GitHub

[AOTI] Swith GPU codegen to one-pass #141980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

desertfire wants to merge 9 commits into gh/desertfire/516/base from gh/desertfire/516/head

Contributor

desertfire commented Dec 3, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @chauhang @aakhundov

Differential Revision: D66739414


          Update

d2350a6

[ghstack-poisoned]

This was referenced Dec 3, 2024

[AOTI] Remove WrapperCodegen.expr_printer #141388

Closed

[AOTI] Refactor codegen_inputs in wrapper codegen #141965

Closed

[AOIT] Remove several overloaded members from WrapperCodegen #141387

Closed

pytorch-bot bot commented Dec 3, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141980

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 411d5cc with merge base 05c1f37 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire mentioned this pull request

[AOTI] Refactor additional_files generation #141979

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels

desertfire added release notes: inductor topic: improvements labels


          Update

2cb8cb5

[ghstack-poisoned]

desertfire requested review from chenyang78 and removed request for chenyang78

December 3, 2024 22:56


          Update

c08e98a

[ghstack-poisoned]

Contributor Author

desertfire commented Dec 4, 2024

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorch-bot bot added the ciflow/trunk label


          Update

008ea37

[ghstack-poisoned]

desertfire added a commit that referenced this pull request


          [AOTI] Swith GPU codegen to one-pass

a42bfc7

Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

ghstack-source-id: 074d07e
Pull Request resolved: #141980

Contributor Author

desertfire commented Dec 4, 2024

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


          Update

7b282f9

[ghstack-poisoned]

desertfire added a commit that referenced this pull request


          [AOTI] Swith GPU codegen to one-pass

4c76e45

Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

ghstack-source-id: 916ea46
Pull Request resolved: #141980

Contributor Author

desertfire commented Dec 4, 2024

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

chenyang78 approved these changes

View reviewed changes

Contributor

chenyang78 left a comment

LGTM overall. Thanks!


          Update

[ghstack-poisoned]


          Update

dcce304

[ghstack-poisoned]


          Update

[ghstack-poisoned]

desertfire mentioned this pull request

[AOTI] Fix multi-kernel codegen when using one-pass #142333

Closed

Contributor Author

desertfire commented Dec 8, 2024

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.


          Update

411d5cc

[ghstack-poisoned]

desertfire added a commit that referenced this pull request


          [AOTI] Fix multi-kernel codegen when using one-pass

d46b6f8

Summary: Update multi-kernel codegen to one-pass, following #141980.

ghstack-source-id: 543ef77
Pull Request resolved: #142333

Contributor Author

desertfire commented Dec 9, 2024

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Contributor

facebook-github-bot commented Dec 9, 2024

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Dec 9, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

4d43ec2

pytorchmergebot added Merged and removed merging labels

pytorchmergebot pushed a commit that referenced this pull request


          [AOTI] Fix multi-kernel codegen when using one-pass (#142333)

5fc9f41

Summary: Update multi-kernel codegen to one-pass, following #141980.

Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717)
Pull Request resolved: #142333
Approved by: https://github.com/chenyang78
ghstack dependencies: #141980

desertfire mentioned this pull request

[AOTI] Fix an autotune block grid computation issue #143098

Closed

facebook-github-bot pushed a commit that referenced this pull request


          [AOTI] Fix an autotune block grid computation issue (#143098)

8eacd41

Summary:

There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987

pytorchmergebot pushed a commit that referenced this pull request


          [AOTI] Fix an autotune block grid computation issue (#143098)

3e1f587

Summary: There is a grid computation issue after switching to one-pass codegen in #141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987

Pull Request resolved: #143098
Approved by: https://github.com/henrylhtsang

github-actions bot deleted the gh/desertfire/516/head branch

January 9, 2025 02:21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Merged module: inductor release notes: inductor topic: improvements