KEMBAR78
[state_dict] Add cpu_only and ranks_only support for _gather_state_dict by fegin · Pull Request #112836 · pytorch/pytorch · GitHub
Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Nov 3, 2023

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 3, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112836

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 9752f60 with merge base 31ded95 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
@fegin fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Nov 3, 2023
fegin added 4 commits November 7, 2023 10:41
…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
@fegin fegin marked this pull request as ready for review November 8, 2023 01:31
@fegin fegin requested a review from LucasLLC as a code owner November 8, 2023 01:31
…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
Args:
state_dict (Dict[str, Any]): the target sharded state_dict.
pg (Optional[dist.ProcessGroup]): the process group that is used to
gather ShardedTensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gather DTensor or ShardedTensor.

pg (Optional[dist.ProcessGroup]): the process group that is used to
gather ShardedTensor.
device: (Optional[torch.device]): the device that is used to
perform allgather for ShardedTensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gather DTensor or ShardedTensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShardedTensor only. DTensor uses DeviceMesh to gather the tensors.

value = value.to(cpu_device)

new_state_dict[key] = value
if not cpu_offload or len(ranks_only) == 0 or dist.get_rank(pg) in ranks_only:
Copy link
Contributor

@wz337 wz337 Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the condition is a bit complicated and I can't think thru lol, but is dist.get_rank(pg) in ranks_only alone not sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ranks_only can be empty (the default value). But it seems that we should not restrict this function to be cpu_offload only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense!

…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
Copy link
Contributor

@LucasLLC LucasLLC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

…er_state_dict"

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Nov 10, 2023
#112885)

This allows us to do cpu_offload with the same traversal logic

Differential Revision: [D50982355](https://our.internmc.facebook.com/intern/diff/D50982355/)
Pull Request resolved: #112885
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #112836
pytorchmergebot pushed a commit that referenced this pull request Nov 13, 2023
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
…ct (pytorch#112836)

Add cpu_only and ranks_only support for _gather_state_dict

Differential Revision: [D50962980](https://our.internmc.facebook.com/intern/diff/D50962980/)
Pull Request resolved: pytorch#112836
Approved by: https://github.com/LucasLLC, https://github.com/wz337
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
pytorch#112885)

This allows us to do cpu_offload with the same traversal logic

Differential Revision: [D50982355](https://our.internmc.facebook.com/intern/diff/D50982355/)
Pull Request resolved: pytorch#112885
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: pytorch#112836
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023
@facebook-github-bot facebook-github-bot deleted the gh/fegin/176/head branch November 14, 2023 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants