Fix param and buffer mapping for state_dict when there are state_dict hooks #137609

yushangdi · 2024-10-09T17:44:59Z

Summary:

We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks.
For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks.
To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict().
Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically.

nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy.

One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have mod.layers[3].layer.sa_norm.scale.

But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"].
In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy.

Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly.

In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict.

The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks.
The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from _disabled_load_state_dict_hooks(M) context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state.
If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC.

Test Plan:

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_for_training_with_state_dict_hooks

Differential Revision: D64080561

pytorch-bot · 2024-10-09T17:45:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137609

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 0468fa1 with merge base 93bbc8a ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh) (similar failure)
distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees
trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 2, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh) (similar failure)
distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-09T17:45:17Z

This pull request was exported from Phabricator. Differential Revision: D64080561

pianpwk · 2024-10-09T18:01:14Z

torch/export/_trace.py

This looks like for the FQN, we store the post-processed, state_dict key, not necessarily the path where the attribute exists. From the discussion I'm not 100% sure what the semantics are, and if this breaks unflattening? Do you know if export() + unflatten() works for the test case you added?

cc: @angelayi

discussed in chat, summarize the discussion here:

we can modify the verifier to match against the state_dict without the hook

add a test for unflattening

add a context manager, so in export, module’s state_dict hook is removed. So now in verifiers, we are matching against the state_dict without hook. Effectively, we ignore state_dict_hooks in export.

One caveat is then for any model with state_dict hook, one won’t be able to interchange between export_program.state_dict() & mod.state_dict().

export_program.state_dict() still works with it self, but you can’t load a model’s state_dict to exported_program, or vice versa.

pianpwk · 2024-10-09T18:02:52Z

torch/export/_trace.py

I think we could add a context manager around this call that removes/puts back the state dict hooks on mod upon enter/exit, that way we don't have to make the nn.Module changes

facebook-github-bot · 2024-10-09T22:26:01Z

This pull request was exported from Phabricator. Differential Revision: D64080561

facebook-github-bot · 2024-10-09T22:39:09Z

This pull request was exported from Phabricator. Differential Revision: D64080561

facebook-github-bot · 2024-10-09T23:02:30Z

This pull request was exported from Phabricator. Differential Revision: D64080561

pianpwk · 2024-10-10T00:14:06Z

torch/export/exported_program.py

Hmm what's the intended usage for this function?

Hmm what's the intended usage for this function?

This this for weight swapping.

At time T1: you're running export, getting a exported program
At time T2: Some serving service serves an artifact from the exported program
At time T3: there is a recurring training job that just finished and updates the model state that is stored.
At time T4: the serving service is going to pick up the same compiled artifact with the new state that was just updated.

This is used to store the new model state at time T3.

something like,

ep = export(model) d = exported_program_state_dict(model) # update ep's state_dict with d.

pianpwk · 2024-10-10T00:16:25Z

torch/export/exported_program.py

Ah, what I had mind with this was to wrap it around some broad chunk of code in export (maybe _export_func), so that anyone working on export who doesn't know about this issue can just call state_dict(). That way we also don't have to do the manual construction from named_parameters + named_buffers. But I'm happy with the chunk of code in _trace.py

angelayi · 2024-10-10T00:27:19Z

test/export/test_export.py

maybe this test case could match what we were seeing before with more modules? Like if a user is trying to remove a layer out of the state dict

maybe this test case could match what we were seeing before with more modules? Like if a user is trying to remove a layer out of the state dict

fixed now.

angelayi · 2024-10-10T00:30:00Z

torch/export/exported_program.py

let's keep this private for now, and not in this file... maybe in utils?

let's keep this private for now, and not in this file... maybe in utils?

moved to utils now.

facebook-github-bot · 2024-10-10T03:16:58Z

This pull request was exported from Phabricator. Differential Revision: D64080561

facebook-github-bot · 2024-10-10T16:07:04Z

This pull request was exported from Phabricator. Differential Revision: D64080561

angelayi

thanks for pushing this through!

angelayi · 2024-10-10T16:26:16Z

torch/_export/utils.py

nit:

Suggested change

def _disabled_load_state_dict_hooks(mod: torch.nn.Module):

def _disable_load_state_dict_hooks(mod: torch.nn.Module):

angelayi · 2024-10-10T16:26:46Z

torch/_export/utils.py

nit: I don't think we need this function, we can just tell ppl to directly use the disable hook?

sure, removed now.

… hooks (pytorch#137609) Summary: We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks. For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks. To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict(). Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically. nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy. One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have `mod.layers[3].layer.sa_norm.scale`. But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in `mod.state_dict()` to reflect the static model definition, so we have `mod.state_dict()["layers.3.sa_norm.scale"]`. In this Diff, we change ExportedProgram to populate its state_dict using `named_parameters()` and `named_buffers()` instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy. Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly. In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict. The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from `torch.export.exported_program` instead of `model.state_dict()` if the model has state_dict hooks. The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `torch._export.utils._disabled_load_state_dict_hooks` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state. Example: ``` with _disabled_load_state_dict_hooks(M): state_dict = M.state_dict() ``` If a model doesn't have any state_dict hooks, one can still use `model.state_dict()` for weight swapping, so it's BC. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks ``` Reviewed By: angelayi Differential Revision: D64080561

facebook-github-bot · 2024-10-10T16:39:53Z

This pull request was exported from Phabricator. Differential Revision: D64080561

… hooks (pytorch#137609) Summary: We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks. For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks. To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict(). Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically. nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy. One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition". In runtime, we have `mod.layers[3].layer.sa_norm.scale`. But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in `mod.state_dict()` to reflect the static model definition, so we have `mod.state_dict()["layers.3.sa_norm.scale"]`. In this Diff, we change ExportedProgram to populate its state_dict using `named_parameters()` and `named_buffers()` instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy. Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly. In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict. The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from `torch.export.exported_program` instead of `model.state_dict()` if the model has state_dict hooks. The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `torch._export.utils._disabled_load_state_dict_hooks` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state. Example: ``` with _disabled_load_state_dict_hooks(M): state_dict = M.state_dict() ``` If a model doesn't have any state_dict hooks, one can still use `model.state_dict()` for weight swapping, so it's BC. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_export_for_training_with_state_dict_hooks ``` Reviewed By: angelayi, pianpwk Differential Revision: D64080561

facebook-github-bot · 2024-10-10T20:55:33Z

This pull request was exported from Phabricator. Differential Revision: D64080561

facebook-github-bot · 2024-10-11T01:32:06Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-10-11T01:33:41Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

yushangdi requested review from albanD, angelayi, avikchaudhuri, jbschlosser, mikaylagawarecki, tugsbayasgalan, ydwu4 and zhxchen17 as code owners October 9, 2024 17:45

pytorch-bot bot added the release notes: export label Oct 9, 2024

facebook-github-bot added the fb-exported label Oct 9, 2024

pianpwk reviewed Oct 9, 2024

View reviewed changes

yushangdi force-pushed the export-D64080561 branch from eda7f9a to e52f3f3 Compare October 9, 2024 22:25

yushangdi force-pushed the export-D64080561 branch from e52f3f3 to 22a5116 Compare October 9, 2024 22:39

yushangdi force-pushed the export-D64080561 branch from 22a5116 to 73783e2 Compare October 9, 2024 23:02

yushangdi mentioned this pull request Oct 9, 2024

export_for_training regression on Llama3_2_vision text decoder #137540

Closed

pianpwk reviewed Oct 10, 2024

View reviewed changes

angelayi reviewed Oct 10, 2024

View reviewed changes

yushangdi force-pushed the export-D64080561 branch from 73783e2 to 71664be Compare October 10, 2024 03:16

yushangdi force-pushed the export-D64080561 branch from 71664be to 9cf1dd3 Compare October 10, 2024 16:06

pytorch-bot bot added the ciflow/inductor label Oct 10, 2024

yushangdi requested review from angelayi and pianpwk October 10, 2024 16:07

angelayi approved these changes Oct 10, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2024

yushangdi force-pushed the export-D64080561 branch from 9cf1dd3 to b88f5ac Compare October 10, 2024 16:39

pianpwk approved these changes Oct 10, 2024

View reviewed changes

yushangdi force-pushed the export-D64080561 branch from b88f5ac to 0468fa1 Compare October 10, 2024 20:55

pytorchmergebot added the merging label Oct 11, 2024

pytorchmergebot added the Merged label Oct 11, 2024

pytorchmergebot closed this in 9d4cb0d Oct 11, 2024

pytorchmergebot removed the merging label Oct 11, 2024

	def _disabled_load_state_dict_hooks(mod: torch.nn.Module):
	def _disable_load_state_dict_hooks(mod: torch.nn.Module):

Fix param and buffer mapping for state_dict when there are state_dict hooks #137609

Fix param and buffer mapping for state_dict when there are state_dict hooks #137609

Uh oh!

Conversation

yushangdi commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137609

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yushangdi Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

facebook-github-bot commented Oct 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pianpwk Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

angelayi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 10, 2024

Uh oh!

facebook-github-bot commented Oct 11, 2024

Uh oh!

pytorchmergebot commented Oct 11, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yushangdi commented Oct 9, 2024 •

edited

Loading

pytorch-bot bot commented Oct 9, 2024 •

edited

Loading

yushangdi Oct 9, 2024 •

edited

Loading

pianpwk Oct 10, 2024 •

edited

Loading