KEMBAR78
[PP] Remove modifications to autograd nodes in ZB by H-Huang · Pull Request #136678 · pytorch/pytorch · GitHub
Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Sep 25, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136678

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c9bbe2d with merge base 9992084 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 25, 2024
cc XilunWu awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 25, 2024
ghstack-source-id: 74d4756
Pull Request resolved: #136678
@H-Huang H-Huang added the release notes: distributed (pipeline) release notes category label Sep 25, 2024

# backward of loss with respect to weights
dweights = stage_backward_weight(mod.parameters(), param_groups)
stage_backward_weight(mod.parameters(), param_groups, retain_graph=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why retain graph=true? Is it because we reuse the same graph for subsequent micro batches? Or does each micro batch have its own graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is a test. Never mind

@H-Huang
Copy link
Member Author

H-Huang commented Sep 26, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 26, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@H-Huang
Copy link
Member Author

H-Huang commented Sep 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Comment on lines 39 to +40

def reverse_closure(
roots: List[Node], target_nodes: Set[Node]
roots: List[Node], target_nodes: Set[Node], reverse_edges_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add doc to the reverse_edges_dict argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

Comment on lines 209 to +210
def stage_backward_weight(
weights: Iterator[Parameter], param_groups: List[Dict[str, Any]]
weights: Iterator[Parameter], param_groups: List[Dict[str, Any]], retain_graph=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It the retain_graph flag is just for test purpose, shall we make it _retain_graph and add a banner saying "Test only; don't use"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added retain_graph to better align with the .backward() API in the case that someone wants to perform multiple backwards (double backward) and accumulate the gradients. I was using this in testing and I dont think anyone besides us is using the stage_backward_input and stage_backward_weight, but they dont necessarily need to only apply to stages and could be used as a more general API.

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@kwen2501
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

injiiiiil pushed a commit to injiiiiil/654 that referenced this pull request Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants