KEMBAR78
[ONNX] A better way to safe guard 2GB model serialization by BowenBao · Pull Request #111984 · pytorch/pytorch · GitHub
Skip to content

Conversation

@BowenBao
Copy link
Collaborator

@BowenBao BowenBao commented Oct 25, 2023

Stack from ghstack (oldest at bottom):

Summary

  • faster than previous try-catch.
  • more stable than previous try-catch. In some circumstances serializing models > 2GB into a single protobuf file ends up with a corrupted file without raising an exception.

@pytorch-bot pytorch-bot bot added the release notes: onnx torch.onnx related changes that should show up in the release notes label Oct 25, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 25, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111984

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ab2f075 with merge base b01e87d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

BowenBao added a commit that referenced this pull request Oct 25, 2023
@BowenBao BowenBao added the topic: improvements topic category label Oct 25, 2023
@BowenBao BowenBao marked this pull request as ready for review October 25, 2023 00:52
Copy link
Collaborator

@thiagocrepaldi thiagocrepaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on magic numbers is usually not reliable and might need tweaks from time to time.

Ideally, we should find out the root cause on why we get a corrupted protobuf when the try-catch doesn't raise

Is there a repro that motivated this change?

Summary
- faster than previous try-catch.
- more stable than previous try-catch. In some circumstances serializing models > 2GB into a single protobuf file ends up with a corrupted file without raising an exception.


[ghstack-poisoned]
BowenBao added a commit that referenced this pull request Oct 25, 2023
@BowenBao
Copy link
Collaborator Author

This is the most proper way. try-catch was a hack because back then I didn't realize we can access the accurate total bytesize. Back in export.cpp for torchscript exporter we cannot.

This issue was discovered by models from the torchbench. It may be repro-ed by exporting a large model such as llama2 or stable diffusion xl. However the issue only happened recently. Feel free to take a look if you have bandwidth.

@BowenBao
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 25, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/BowenBao/309/head branch October 29, 2023 14:23
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
…1984)

Summary
- faster than previous try-catch.
- more stable than previous try-catch. In some circumstances serializing models > 2GB into a single protobuf file ends up with a corrupted file without raising an exception.

Pull Request resolved: pytorch#111984
Approved by: https://github.com/justinchuby
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…1984)

Summary
- faster than previous try-catch.
- more stable than previous try-catch. In some circumstances serializing models > 2GB into a single protobuf file ends up with a corrupted file without raising an exception.

Pull Request resolved: pytorch#111984
Approved by: https://github.com/justinchuby
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: onnx torch.onnx related changes that should show up in the release notes topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants