[state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict #112837

fegin · 2023-11-03T08:04:33Z

Stack from ghstack (oldest at bottom):

-> [state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict #112837
[state_dict] Rewrite _gather_state_dict to extract the traversal logic #112885
[state_dict] Add cpu_only and ranks_only support for _gather_state_dict #112836

As title

Differential Revision: D50962991

…state_dict As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

pytorch-bot · 2023-11-03T08:04:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112837

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 3794436 with merge base 31ded95 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / macos-12-py3-x86-64 / build (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…state_dict As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) ghstack-source-id: 206371083 Pull Request resolved: #112837

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

…state_dict Pull Request resolved: #112837 As title ghstack-source-id: 206465442 @exported-using-ghexport Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

…state_dict Pull Request resolved: #112837 As title ghstack-source-id: 206762418 @exported-using-ghexport Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

…state_dict Pull Request resolved: #112837 As title ghstack-source-id: 206788209 @exported-using-ghexport Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

…state_dict Pull Request resolved: #112837 As title ghstack-source-id: 206890572 @exported-using-ghexport Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

LucasLLC · 2023-11-09T20:29:03Z

torch/distributed/checkpoint/state_dict.py

-    - ``fsdp_state_dict_type``: if the model contains FSDP sharded submodules,
-      what FSDP state_dict type should be used.
-      The default value is SHARDED_STATE_DICT.
+    - ``full_state_dict``: if this is set to True, all the tensors in the


@fegin n00b question - Assuming AsyncCollectiveTensors are still returned without waiting?

I'll update the dtensor gathering to use the full_tensor() API in a follow-up PR, which will always return synchronously so we can make sure that the value in the state_dict is correct.

LucasLLC

Still catching up on the underlying implementations but LGTM!

wz337 · 2023-11-09T20:39:21Z

torch/distributed/checkpoint/state_dict.py

-    - ``fsdp_state_dict_type``: if the model contains FSDP sharded submodules,
-      what FSDP state_dict type should be used.
-      The default value is SHARDED_STATE_DICT.
+    - ``full_state_dict``: if this is set to True, all the tensors in the


I'll update the dtensor gathering to use the full_tensor() API in a follow-up PR, which will always return synchronously so we can make sure that the value in the state_dict is correct.

wz337 · 2023-11-09T20:40:54Z

torch/distributed/fsdp/_optim_utils.py

        elif not cpu_offload:
            with SimpleProfiler.profile("clone"):
-                value = value.detach.clone()
+                value = value.detach().clone()


lol. How come our tests haven't not caught this?

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

…state_dict Pull Request resolved: #112837 As title ghstack-source-id: 207082807 @exported-using-ghexport Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

…state_dict Pull Request resolved: #112837 As title ghstack-source-id: 207150105 @exported-using-ghexport Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/)

facebook-github-bot · 2023-11-13T10:01:06Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2023-11-13T10:02:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…state_dict (pytorch#112837) As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) Pull Request resolved: pytorch#112837 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: pytorch#112836, pytorch#112885

carmocca · 2024-02-19T14:25:31Z

torch/distributed/checkpoint/state_dict.py

+    - ``cpu_offload``: offload all the tensors to cpu. To prevent CPU OOM, if
+      ``full_state_dict`` is also true, then only the rank0 will get the
+      state_dict and all other ranks will get empty state_dict.


Hi! This is a sensible default, but why wasn't rank0_only exposed in this options class in addition to cpu_offload? For setups with enough RAM, loading the full CPU model weights on all ranks could be desirable.

Thank you!

This is a subtle issue that not all users may be aware. So we would like to avoid OOM issues as possible. And in many use cases, when users do full_state_dict, only rank0 is going to save the state_dict. What's the use case for all the ranks to save the duplicated states?

For saving, I agree completely.

I was asking about loading. Doesn't this logic also apply during loading?
You might want to cpu-offload the loaded checkpoint, but still have all ranks load it into the model/optimizer.

[state_dict][11/N] Implement cpu_offload and full_state_dict for get_…

a13c49a

…state_dict As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

fegin requested review from H-Huang, awgu, d4l3k, fduwjj, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners November 3, 2023 08:04

This was referenced Nov 3, 2023

[state_dict] Move _gather_state_dict to dcp module #112835

Closed

[state_dict] Add cpu_only and ranks_only support for _gather_state_dict #112836

Closed

fegin marked this pull request as draft November 3, 2023 08:05

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Nov 3, 2023

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

b361f49

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

fegin mentioned this pull request Nov 3, 2023

[state_dict] Rewrite _gather_state_dict to extract the traversal logic #112885

Closed

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

7bfdda4

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

d4bf870

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

60ac9ec

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

a27c4d4

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

f9942c2

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 7, 2023

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

f8b1179

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

fegin marked this pull request as ready for review November 8, 2023 01:31

fegin requested a review from LucasLLC as a code owner November 8, 2023 01:31

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

64dfc0c

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

0f75aa2

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

LucasLLC reviewed Nov 9, 2023

View reviewed changes

LucasLLC approved these changes Nov 9, 2023

View reviewed changes

wz337 approved these changes Nov 9, 2023

View reviewed changes

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

64c1dfc

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

fegin removed the release notes: distributed (fsdp) release notes category label Nov 10, 2023

Update on "[state_dict][11/N] Implement cpu_offload and full_state_di…

3794436

…ct for get_state_dict" As title Differential Revision: [D50962991](https://our.internmc.facebook.com/intern/diff/D50962991/) [ghstack-poisoned]

pytorchmergebot added the merging label Nov 13, 2023

pytorchmergebot added the Merged label Nov 13, 2023

pytorchmergebot removed the merging label Nov 13, 2023

pytorchmergebot closed this in 2bcff4d Nov 13, 2023

facebook-github-bot deleted the gh/fegin/177/head branch November 16, 2023 15:30

carmocca reviewed Feb 19, 2024

View reviewed changes

carmocca mentioned this pull request Feb 19, 2024

[WIP] Use torch 2.2 distributed checkpoint APIs for FSDP Lightning-AI/pytorch-lightning#19497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict #112837

[state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict #112837

Uh oh!

fegin commented Nov 3, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 3, 2023 •

edited

Loading

Uh oh!

LucasLLC Nov 9, 2023

Uh oh!

wz337 Nov 9, 2023

Uh oh!

LucasLLC left a comment

Uh oh!

wz337 Nov 9, 2023

Uh oh!

wz337 Nov 9, 2023

Uh oh!

facebook-github-bot commented Nov 13, 2023

Uh oh!

pytorchmergebot commented Nov 13, 2023

Uh oh!

carmocca Feb 19, 2024

Uh oh!

fegin Feb 20, 2024

Uh oh!

carmocca Feb 20, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict #112837

[state_dict][11/N] Implement cpu_offload and full_state_dict for get_state_dict #112837

Uh oh!

Conversation

fegin commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112837

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

LucasLLC Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

wz337 Nov 9, 2023

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 13, 2023

Uh oh!

pytorchmergebot commented Nov 13, 2023

Merge started

Uh oh!

carmocca Feb 19, 2024

Choose a reason for hiding this comment

Uh oh!

fegin Feb 20, 2024

Choose a reason for hiding this comment

Uh oh!

carmocca Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fegin commented Nov 3, 2023 •

edited

Loading

pytorch-bot bot commented Nov 3, 2023 •

edited

Loading

carmocca Feb 20, 2024 •

edited

Loading