KEMBAR78
[aoti] Remove dir after packaging by angelayi · Pull Request #140022 · pytorch/pytorch · GitHub
Skip to content

Conversation

@angelayi
Copy link
Contributor

@angelayi angelayi commented Nov 7, 2024

Update AOTI to return a list of files that it generates when aot_inductor.package=True. Then we will only package the files that are in that list.

This should fix the caching issue and hopefully #140053.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

@angelayi angelayi requested a review from desertfire November 7, 2024 17:32
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140022

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 464e2a8 with merge base 22dfb5b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2024
Copy link
Contributor

@desertfire desertfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't solve the problem when different runs can generate different cubin files. We can end up with including unnecessary cubin files.

I think a better way to solve this is in codecache.py. Using an unique subdirectory to store .so and other relevant files and package afterwards. This way, all the previous auto-tuning results are still kept and we will not package unnecessary files.

@desertfire
Copy link
Contributor

Also your test script should still be added as a unit test. It should work with some tweaks.

@desertfire
Copy link
Contributor

This doesn't solve the problem when different runs can generate different cubin files. We can end up with including unnecessary cubin files.

I think a better way to solve this is in codecache.py. Using an unique subdirectory to store .so and other relevant files and package afterwards. This way, all the previous auto-tuning results are still kept and we will not package unnecessary files.

I see you are actually deleting the whole directory afterwards. This doesn't solve the keep caching request that Henry raised.

@henrylhtsang
Copy link
Contributor

This can unblock (previously will run into errors). But would like to see a way to cache things so iteration speed can be better.

For torch.compile, subsequent runs takes 1/10 of the time to compile due to local cache. It would be nice if AOTI can have similar features.

@angelayi angelayi force-pushed the angelayi/package_cache branch from e7a3730 to dbc512a Compare November 9, 2024 01:05
@angelayi angelayi requested a review from desertfire November 9, 2024 01:08
@facebook-github-bot
Copy link
Contributor

@angelayi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@angelayi angelayi force-pushed the angelayi/package_cache branch 2 times, most recently from 48bb762 to b4339ac Compare November 11, 2024 17:56
Copy link
Contributor

@desertfire desertfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall.

The CI "out of disk" failure is real. I think it's related to the fact that when we call aoti_load_package, we unzip the files to the top level /tmp directory, which will not removed after running each benchmark. The large weight files gradually eat the disk and eventually triggers the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a complicated API, why not always return a list, albeit for .so it will be a single element list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to break any existing callsites 😅 but yes! I can fix the rest of the callsites to return a list in a followup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a discussion for different PR, but should it really be .so on all platforms? Wouldn't it be more reasonable to check for sysconfig.get_config_var('EXT_SUFFIX')?

@angelayi angelayi force-pushed the angelayi/package_cache branch 2 times, most recently from f3e385f to 423e0cb Compare November 12, 2024 01:29
pytorchmergebot added a commit that referenced this pull request Nov 13, 2024
This reverts commit ba136a7.

Reverted #140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](#140022 (comment)))
@pytorchmergebot
Copy link
Collaborator

@angelayi your PR has been successfully reverted.

Summary:
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list. 

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully #140053.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov


Reviewed By: pianpwk

Differential Revision: D65862850

Pulled By: angelayi
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65862850

pytorch-bot bot pushed a commit that referenced this pull request Nov 14, 2024
Summary: Reland  #140022

Test Plan: CI

Differential Revision: D65929964
@angelayi angelayi closed this Nov 14, 2024
pytorchmergebot pushed a commit that referenced this pull request Nov 15, 2024
Summary: Reland  #140022

Test Plan: CI

Differential Revision: D65929964

Pull Request resolved: #140675
Approved by: https://github.com/desertfire
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully pytorch#140053.

Pull Request resolved: pytorch#140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This reverts commit 8c6abe5.

Reverted pytorch#140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](pytorch#140022 (comment)))
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully pytorch#140053.

Pull Request resolved: pytorch#140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This reverts commit ba136a7.

Reverted pytorch#140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](pytorch#140022 (comment)))
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
)

Summary: Reland  pytorch#140022

Test Plan: CI

Differential Revision: D65929964

Pull Request resolved: pytorch#140675
Approved by: https://github.com/desertfire
@github-actions github-actions bot deleted the angelayi/package_cache branch December 15, 2024 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.