[PP] Remove modifications to autograd nodes in ZB #136678

H-Huang · 2024-09-25T21:04:08Z

Stack from ghstack (oldest at bottom):

cc @XilunWu @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-09-25T21:04:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136678

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c9bbe2d with merge base 9992084 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc XilunWu awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 74d4756 Pull Request resolved: #136678

wconstab · 2024-09-25T21:16:24Z

test/distributed/pipelining/test_backward.py


        # backward of loss with respect to weights
-        dweights = stage_backward_weight(mod.parameters(), param_groups)
+        stage_backward_weight(mod.parameters(), param_groups, retain_graph=True)


Why retain graph=true? Is it because we reuse the same graph for subsequent micro batches? Or does each micro batch have its own graph?

Oh this is a test. Never mind

H-Huang · 2024-09-26T02:35:50Z

@pytorchbot merge

pytorchmergebot · 2024-09-26T02:37:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-26T08:36:18Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

H-Huang · 2024-09-26T14:02:52Z

@pytorchbot merge

pytorchmergebot · 2024-09-26T14:04:36Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kwen2501 · 2024-09-26T17:18:19Z

torch/distributed/pipelining/_backward.py


 def reverse_closure(
-    roots: List[Node], target_nodes: Set[Node]
+    roots: List[Node], target_nodes: Set[Node], reverse_edges_dict


nit: add doc to the reverse_edges_dict argument.

kwen2501 · 2024-09-26T17:21:06Z

torch/distributed/pipelining/_backward.py

 def stage_backward_weight(
-    weights: Iterator[Parameter], param_groups: List[Dict[str, Any]]
+    weights: Iterator[Parameter], param_groups: List[Dict[str, Any]], retain_graph=False


It the retain_graph flag is just for test purpose, shall we make it _retain_graph and add a banner saying "Test only; don't use"?

I added retain_graph to better align with the .backward() API in the case that someone wants to perform multiple backwards (double backward) and accumulate the gradients. I was using this in testing and I dont think anyone besides us is using the stage_backward_input and stage_backward_weight, but they dont necessarily need to only apply to stages and could be used as a more general API.

pytorchmergebot · 2024-09-26T20:03:21Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

kwen2501 · 2024-09-27T07:05:41Z

@pytorchbot merge

pytorchmergebot · 2024-09-27T07:07:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 2169168 Pull Request resolved: pytorch/pytorch#136678

[PP] Remove modifications to autograd nodes in ZB

6b0b707

[ghstack-poisoned]

H-Huang mentioned this pull request Sep 25, 2024

[pipelining] Fix more leaks and check leaks in tests #136584

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 25, 2024

Update on "[PP] Remove modifications to autograd nodes in ZB"

c9bbe2d

cc XilunWu awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

H-Huang added a commit that referenced this pull request Sep 25, 2024

[PP] Remove modifications to autograd nodes in ZB

4dcdac4

ghstack-source-id: 74d4756 Pull Request resolved: #136678

H-Huang added the release notes: distributed (pipeline) release notes category label Sep 25, 2024

wconstab approved these changes Sep 25, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 26, 2024

pytorchmergebot added the merging label Sep 26, 2024

kwen2501 reviewed Sep 26, 2024

View reviewed changes

kwen2501 approved these changes Sep 26, 2024

View reviewed changes

pytorchmergebot added the Merged label Sep 27, 2024

pytorchmergebot closed this in ad6c70b Sep 27, 2024

pytorchmergebot removed the merging label Sep 27, 2024

H-Huang mentioned this pull request Sep 27, 2024

Fix prefix store seg fault #136872

Closed

injiiiiil pushed a commit to injiiiiil/654 that referenced this pull request Oct 1, 2024

[PP] Remove modifications to autograd nodes in ZB

130d400

ghstack-source-id: 2169168 Pull Request resolved: pytorch/pytorch#136678

[PP] Remove modifications to autograd nodes in ZB #136678

[PP] Remove modifications to autograd nodes in ZB #136678

Uh oh!

Conversation

H-Huang commented Sep 25, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136678

✅ No Failures

Uh oh!

wconstab Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab Sep 25, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Sep 26, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 26, 2024

Uh oh!

H-Huang commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Sep 26, 2024

Merge started

Uh oh!

kwen2501 Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Sep 26, 2024

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Sep 26, 2024

Uh oh!

kwen2501 commented Sep 27, 2024

Uh oh!

pytorchmergebot commented Sep 27, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

H-Huang commented Sep 25, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading