KEMBAR78
[PT2D] Ensure the trace rules are correct with distributed by fegin · Pull Request #125333 · pytorch/pytorch · GitHub
Skip to content

Conversation

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 1, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125333

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 476470c with merge base 746da87 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@fegin
Copy link
Contributor Author

fegin commented May 2, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

petrex pushed a commit to petrex/pytorch that referenced this pull request May 3, 2024
…25333)

Summary:
1. Avoid using `torch._dynamo.disable`.
2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization.

Pull Request resolved: pytorch#125333
Approved by: https://github.com/yanboliang
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
1. Remove gc.collect(), which is not necessary.
2. Use lru_cache to cache _get_fqns

Pull Request resolved: #125501
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
If an object only exists on certain non-coordinator ranks, we still need to save them. Otherwise, we lose these objects. If they are duplicated, DCP will deduplicate them.

Pull Request resolved: #125334
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #125333, #125501
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
 Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.

 We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.

Pull Request resolved: #125335
Approved by: https://github.com/LucasLLC, https://github.com/wz337
ghstack dependencies: #125333, #125501, #125334
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335
pytorchmergebot pushed a commit that referenced this pull request May 7, 2024
…125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336
mvpatel2000 pushed a commit to mvpatel2000/pytorch that referenced this pull request May 17, 2024
Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: pytorch#125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335
atalman added a commit that referenced this pull request May 27, 2024
* [DSD] Correctly handle _extra_state (#125336)

Summary:
distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined.

Pull Request resolved: #125336
Approved by: https://github.com/LucasLLC
ghstack dependencies: #125333, #125501, #125334, #125335

* lint

* lint

---------

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
antoinebrl pushed a commit to antoinebrl/pytorch that referenced this pull request May 27, 2024
…ytorch#125337)

Summary:
Fixes pytorch#122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: pytorch#125337
Approved by: https://github.com/awgu
ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336
huydhn pushed a commit that referenced this pull request May 27, 2024
…125337) (#127219)

* [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337)

Summary:
Fixes #122792

state_dict includes only persistent buffers, while named_buffers() would
include non_persistent buffers.

Pull Request resolved: #125337
Approved by: https://github.com/awgu
ghstack dependencies: #125333, #125501, #125334, #125335, #125336

* lintrunner

* lint

---------

Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
@github-actions github-actions bot deleted the gh/fegin/230/head branch June 5, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants