[PT2D] Ensure the trace rules are correct with distributed #125333

fegin · 2024-05-01T21:11:21Z

Stack from ghstack (oldest at bottom):

Summary:

Avoid using torch._dynamo.disable.
Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng

[ghstack-poisoned]

pytorch-bot · 2024-05-01T21:11:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125333

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 476470c with merge base 746da87 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh) (similar failure)
test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_complex128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin · 2024-05-02T16:26:32Z

@pytorchbot merge

pytorchmergebot · 2024-05-02T16:28:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…25333) Summary: 1. Avoid using `torch._dynamo.disable`. 2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization. Pull Request resolved: pytorch#125333 Approved by: https://github.com/yanboliang

Summary: 1. Remove gc.collect(), which is not necessary. 2. Use lru_cache to cache _get_fqns Pull Request resolved: #125501 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #125333

Summary: If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them. Pull Request resolved: #125334 Approved by: https://github.com/wz337, https://github.com/LucasLLC ghstack dependencies: #125333, #125501

Summary: Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict. We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now. Pull Request resolved: #125335 Approved by: https://github.com/LucasLLC, https://github.com/wz337 ghstack dependencies: #125333, #125501, #125334

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335

…125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: pytorch#125336 Approved by: https://github.com/LucasLLC ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335

* [DSD] Correctly handle _extra_state (#125336) Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335 * lint * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

…ytorch#125337) Summary: Fixes pytorch#122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: pytorch#125337 Approved by: https://github.com/awgu ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336

…125337) (#127219) * [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336 * lintrunner * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

Update

476470c

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 1, 2024

fegin requested review from anijain2305, bdhirsh, wanchaol and yanboliang May 1, 2024 21:18

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2024

yanboliang approved these changes May 2, 2024

View reviewed changes

pytorchmergebot added the merging label May 2, 2024

pytorchmergebot closed this in 1eb7b8e May 2, 2024

pytorchmergebot added Merged and removed merging labels May 2, 2024

mvpatel2000 mentioned this pull request May 17, 2024

[DSD] Correctly handle _extra_state (#125336) #126567

Merged

antoinebrl mentioned this pull request May 27, 2024

[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) #127219

Merged

github-actions bot deleted the gh/fegin/230/head branch June 5, 2024 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PT2D] Ensure the trace rules are correct with distributed #125333

[PT2D] Ensure the trace rules are correct with distributed #125333

Uh oh!

fegin commented May 1, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented May 1, 2024 •

edited

Loading

Uh oh!

fegin commented May 2, 2024

Uh oh!

pytorchmergebot commented May 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[PT2D] Ensure the trace rules are correct with distributed #125333

[PT2D] Ensure the trace rules are correct with distributed #125333

Uh oh!

Conversation

fegin commented May 1, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125333

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

fegin commented May 2, 2024

Uh oh!

pytorchmergebot commented May 2, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fegin commented May 1, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading