[DCP] Always create requests for non-tensor objects #125334

fegin · 2024-05-01T21:11:27Z

Stack from ghstack (oldest at bottom):

Summary:
If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-01T21:11:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125334

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 9cd89e4 with merge base 746da87 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-focal-rocm6.0-py3.8 / test (distributed, 1, 2, linux.rocm.gpu) (gh)
distributed/_tensor/test_attention.py::RingAttentionTest::test_ring_attention_compile_attention_fn1
periodic / linux-focal-rocm6.0-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh)
distributed/fsdp/test_fsdp_core.py::TestParityWithDDP::test_delayed_reduce_scatter_offload_true_shard_grad_op

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / win-vs2019-cuda11.8-py3 / test (default, 2, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
profiler\test_profiler.py::TestProfiler::test_basic_chrome_trace

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

wz337

LGTM!

LucasLLC

lgtm

fegin · 2024-05-07T17:02:48Z

@pytorchbot merge -f "The failing tests are not related."

pytorchmergebot · 2024-05-07T17:04:26Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict. We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now. Pull Request resolved: #125335 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #125333, #125501, #125334

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335

…125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: pytorch#125336 Approved by: https://github.com/LucasLLC ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335

* [DSD] Correctly handle _extra_state (#125336) Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335 * lint * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

…ytorch#125337) Summary: Fixes pytorch#122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: pytorch#125337 Approved by: https://github.com/awgu ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336

…125337) (#127219) * [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336 * lintrunner * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

Update

f1d9763

[ghstack-poisoned]

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2024

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 1, 2024

fegin mentioned this pull request May 1, 2024

Use stateful dataloader to checkpoint data iteration order and token buffer pytorch/torchtitan#279

Merged

fegin added 2 commits May 1, 2024 15:48

Update

035c837

[ghstack-poisoned]

Update

9cd89e4

[ghstack-poisoned]

fegin mentioned this pull request May 3, 2024

[DSD] Improve the performance of distributed state_dict #125501

Closed

wz337 approved these changes May 6, 2024

View reviewed changes

LucasLLC approved these changes May 6, 2024

View reviewed changes

pytorchmergebot added the merging label May 7, 2024

pytorchmergebot added the Merged label May 7, 2024

pytorchmergebot closed this in 22767e4 May 7, 2024

pytorchmergebot removed the merging label May 7, 2024

mvpatel2000 mentioned this pull request May 17, 2024

[DSD] Correctly handle _extra_state (#125336) #126567

Merged

antoinebrl mentioned this pull request May 27, 2024

[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) #127219

Merged

github-actions bot deleted the gh/fegin/231/head branch June 7, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DCP] Always create requests for non-tensor objects #125334

[DCP] Always create requests for non-tensor objects #125334

Uh oh!

fegin commented May 1, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 1, 2024 •

edited

Loading

Uh oh!

wz337 left a comment

Uh oh!

LucasLLC left a comment

Uh oh!

fegin commented May 7, 2024

Uh oh!

pytorchmergebot commented May 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DCP] Always create requests for non-tensor objects #125334

[DCP] Always create requests for non-tensor objects #125334

Uh oh!

Conversation

fegin commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125334

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented May 7, 2024

Uh oh!

pytorchmergebot commented May 7, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented May 1, 2024 •

edited

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading