KEMBAR78
Hacky support for meta tensor serialization. by ezyang · Pull Request #62192 · pytorch/pytorch · GitHub
Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Jul 26, 2021

Stack from ghstack:

This support is hacky because it doesn't preserve meta tensor storage
sharing (e.g., if you serialize a model with shared storage, e.g., a
tensor and a view on a tensor, when I deserialize the viewing
relationship will be broken and these are just different tensors.) The
hack is also durable, in the sense that we will be on the hook for
supporting _rebuild_meta_tensor_no_storage in perpetuity in the
future, even if we change our mind about the serialization format.

This unblocks an FB production use case. I didn't add C++ support to minimize
blast area of this patch.

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D29910535

This support is hacky because it doesn't preserve meta tensor storage
sharing (e.g., if you serialize a model with shared storage, e.g., a
tensor and a view on a tensor, when I deserialize the viewing
relationship will be broken and these are just different tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 26, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 8c1a1a1 (more details on the Dr. CI page):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

fatal: the remote end hung up unexpectedly
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0        
remote: Enumerating objects: 7, done.        
remote: Counting objects:  14% (1/7)        
remote: Counting objects:  28% (2/7)        
remote: Counting objects:  42% (3/7)        
remote: Counting objects:  57% (4/7)        
remote: Counting objects:  71% (5/7)        
remote: Counting objects:  85% (6/7)        
remote: Counting objects: 100% (7/7)        
remote: Counting objects: 100% (7/7), done.        
remote: Compressing objects:  25% (1/4)        
remote: Compressing objects:  50% (2/4)        
remote: Compressing objects:  75% (3/4)        
remote: Compressing objects: 100% (4/4)        
remote: Compressing objects: 100% (4/4), done.        
remote: Total 4 (delta 3), reused 0 (delta 0), pack-reused 0        
Unpacking objects:  25% (1/4)
Unpacking objects:  50% (2/4)
Unpacking objects:  75% (3/4)
Unpacking objects: 100% (4/4)
Unpacking objects: 100% (4/4), 985 bytes | 492.00 KiB/s, done.
From ssh://github.com/pytorch/cpuinfo
 * branch            5916273f79a21551890fd3d56fc5375a78d1598d -> FETCH_HEAD
remote: Total 0 (delta 0), reused 0 (delta 0), pack-reused 0        
Connection to github.com closed by remote host.

fatal: the remote end hung up unexpectedly
Fetched in submodule path 'third_party/cub', but it did not contain d106ddb991a56c3df1b6d51b2409e36ba8181ce4. Direct fetching of that commit failed.


Exited with code exit status 1


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This support is hacky because it doesn't preserve meta tensor storage
sharing (e.g., if you serialize a model with shared storage, e.g., a
tensor and a view on a tensor, when I deserialize the viewing
relationship will be broken and these are just different tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Jul 26, 2021
This support is hacky because it doesn't preserve meta tensor storage
sharing (e.g., if you serialize a model with shared storage, e.g., a
tensor and a view on a tensor, when I deserialize the viewing
relationship will be broken and these are just different tensors.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 9ff1bc9
Pull Request resolved: #62192
@ezyang
Copy link
Contributor Author

ezyang commented Jul 26, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang ezyang requested review from albanD and zou3519 July 26, 2021 15:25
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Maybe a little bit more testing for more general serialization?

f.seek(0)
state = torch.load(f)

self.assertEqual(state.weight.size(), big_model.weight.size())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to check the device?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tensor is plenty big enough :) I would certainly like it if we had more structured serialization tests for different device types, maybe a more involved refactor here is in order.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in cf1f594.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants