Stateful Checkpointing for Distributed [1/N] #113867

LucasLLC · 2023-11-16T14:35:25Z

First pass at adding a save/load API, as well as definition of Stateful objects.

Amongst a couple todo's, we still need to explore adding an all_gather & potentially a barrier while iterating through state keys.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @kiukchung @d4l3k @LucasLLC

…e test

pytorch-bot · 2023-11-16T14:35:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113867

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3c2dcb7 with merge base ec124b9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin

LGTM, the failing test is real.

…ectness

wz337

LGTM! Thanks!

wz337 · 2023-11-30T19:15:23Z

torch/distributed/checkpoint/examples/stateful_example.py

@@ -0,0 +1,104 @@
+# Owner(s): ["oncall: distributed"]


We can later move this to https://github.com/pytorch/examples or update this tutorial in https://github.com/pytorch/tutorials/blob/main/recipes_source/distributed_checkpoint_recipe.rst.

From what I learned, we should not keep the examples PT repo, but we still have a couples for checkpoint and other distributed componenets.

wconstab · 2023-11-30T23:44:25Z

torch/distributed/checkpoint/examples/stateful_example.py

+        return torch.rand(8, 8, device="cuda")
+
+
+def _make_stateful(model, optim):


what does this do? should users rely on it, or should they write their own state function on their model?

This question is a bit loaded, I'm still thinking about the best UX around this item in particular. In general, if users are defining objects with custom Stateful behavior, they should define state_dict and load_state_dict on those objects.

For model/optim, which need to call get_state_dict and set_state_dict in state_dict and load_state_dict, the two options so far are:

Create a wrapper:
class DOptim: def __init__(self, model, optim): self.model = model self.optim = optim ...

We're still evaluating whether it makes sense to include the wrapper as part of DCP since it can be a little tricky and could lead to a less then ideal UX

The other option, which I don't think is a bad one, is using the _patch methods as above. The patch methods are still in testing but I think it's pretty reasonable

LucasLLC · 2023-12-01T16:50:03Z

@pytorchbot merge

pytorchmergebot · 2023-12-01T16:52:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

First pass at adding a save/load API, as well as definition of Stateful objects. Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys. Pull Request resolved: pytorch#113867 Approved by: https://github.com/fegin, https://github.com/wz337

LucasLLC added 23 commits November 7, 2023 10:49

Improves comparision of non loaded state dicts and state dicts for e2…

b5d2004

…e test

Fixes formatting issues

3f65534

Merge branch 'main' into add_non_parallel_model_comparison

7dab909

Fixes merge errors

5715274

Merge branch 'main' into add_non_parallel_model_comparison

766a7c3

linter

3dd8d53

Fixes imports

5763ee4

fixes failing tests failing on mac builds

252d367

fix missing comma

2894c34

Merge branch 'main' into add_non_parallel_model_comparison

deb06fa

Adds missing import and moves common_state_dict into distributed

751031a

fix typo

031573a

Merge branch 'main' into add_non_parallel_model_comparison

7ecc4ee

fixes yet another merge issue

e7b7587

lintrunner

97016e2

removes unused param in verify msd

294e727

adds stateful protocol

359dd80

Merge branch 'main' into add_non_parallel_model_comparison

eca053e

Merge branch 'main' into add_non_parallel_model_comparison

ab2cbd1

first rough drafts of stateful loading/saving

13d06d3

Merge branch 'add_non_parallel_model_comparison' into dcp_stateful

dee659b

uses patching and adds a stateful test

e7aa8e4

Merge branch 'main' into dcp_stateful

85d7be5

LucasLLC self-assigned this Nov 16, 2023

LucasLLC added 5 commits November 16, 2023 06:39

Adds comment to evaluate AppState vs STATE_DICT_TYPE

56dcfb5

adds a rough example

51f873e

remove e2e test in favor of stateful e2e

e88dd5e

deprecates save/load_state_dict in favor of save/load

3e150fb

Deprecates loader, adds stateful type

f373592

LucasLLC added 2 commits November 22, 2023 07:45

corrects comments for state dict saver/loader, some cleanliness changes

7ead269

lintrunner

076fd09

github-actions bot added the module: distributed label Nov 22, 2023

LucasLLC changed the title ~~Stateful Checkpointing for Distributed~~ Stateful Checkpointing for Distributed [1/N] Nov 27, 2023

Merge branch 'main' into dcp_stateful

01edb96

fegin approved these changes Nov 27, 2023

View reviewed changes

LucasLLC added 7 commits November 27, 2023 11:45

updates docs for failing build doc test

a83a8d5

testing adding class module as well

b0a42f2

adds stateful to distributed docs

7a7a25f

adds comments

ad0ce9a

style fighting doc styling issues, changes model in e2e test for corr…

ebac1c5

…ectness

formatting warning

28d8c0b

finally fixing the missing indent for sphinx generated docs

9aeb8a8

wz337 approved these changes Nov 30, 2023

View reviewed changes

wconstab reviewed Nov 30, 2023

View reviewed changes

Merge branch 'main' into dcp_stateful

3c2dcb7

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 1, 2023

pytorchmergebot added the merging label Dec 1, 2023

pytorchmergebot added Merged and removed merging labels Dec 1, 2023

pytorchmergebot closed this in f073dcd Dec 1, 2023

albanD added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed module: distributed labels Dec 8, 2023

carmocca mentioned this pull request Feb 13, 2024

FSDP checkpointing uses deprecated APIs with PyTorch 2.2 Lightning-AI/pytorch-lightning#19462

Open

github-actions bot deleted the dcp_stateful branch February 19, 2024 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stateful Checkpointing for Distributed [1/N] #113867

Stateful Checkpointing for Distributed [1/N] #113867

Uh oh!

LucasLLC commented Nov 16, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 16, 2023 •

edited

Loading

Uh oh!

fegin left a comment

Uh oh!

wz337 left a comment

Uh oh!

wz337 Nov 30, 2023

Uh oh!

wconstab Nov 30, 2023

Uh oh!

LucasLLC Dec 1, 2023

Uh oh!

LucasLLC commented Dec 1, 2023

Uh oh!

pytorchmergebot commented Dec 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		return torch.rand(8, 8, device="cuda")


		def _make_stateful(model, optim):

Stateful Checkpointing for Distributed [1/N] #113867

Stateful Checkpointing for Distributed [1/N] #113867

Uh oh!

Conversation

LucasLLC commented Nov 16, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113867

✅ No Failures

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 30, 2023

Choose a reason for hiding this comment

Uh oh!

wconstab Nov 30, 2023

Choose a reason for hiding this comment

Uh oh!

LucasLLC Dec 1, 2023

Choose a reason for hiding this comment

Uh oh!

LucasLLC commented Dec 1, 2023

Uh oh!

pytorchmergebot commented Dec 1, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LucasLLC commented Nov 16, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 16, 2023 •

edited

Loading