KEMBAR78
Allow schedules to run with single stage by H-Huang · Pull Request #138925 · pytorch/pytorch · GitHub
Skip to content

Conversation

@H-Huang
Copy link
Member

@H-Huang H-Huang commented Oct 25, 2024

Stack from ghstack (oldest at bottom):

Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

cc @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138925

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1388eb8 with merge base 2922b9f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 25, 2024
H-Huang added a commit that referenced this pull request Oct 25, 2024
ghstack-source-id: 5273db0
Pull Request resolved: #138925
@H-Huang H-Huang marked this pull request as draft October 25, 2024 18:03
@H-Huang H-Huang added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 26, 2024
cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@H-Huang H-Huang changed the title temp changes for fix Allow schedules to run with single stage Oct 28, 2024
Ran into issues (#138863) when adding a Schedule with single stage for zero bubble, adding code to support this mostly for test purposes

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@H-Huang H-Huang marked this pull request as ready for review October 28, 2024 20:11
@H-Huang H-Huang requested review from kwen2501 and wconstab October 28, 2024 20:11
Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Oct 28, 2024
ghstack-source-id: f031067
Pull Request resolved: #138925
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to fix this another way such that stage 0 still computes input grad separately?

)
grads_input = []
param_groups = []
# Skip the backward for the first stage since we will perform the weight update with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that stage 0 will never run separate W/I computations even in multi stage pipelines?

I think this is a significant problem since in ZB it is more common to use separate W/I for earlier stages than late stages. Last stage may have almost entirely merged full-backward but first stage may need mostly separated ones to fill bubbles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stage 0 still computes W/I but now the I is like a no-op since the real work is in done in W. Typically the input grad would not be computed for stage 0 anyways since the inputs do not require gradients and this skips the .grad() call entirely.

This is only for the case of W/I split, for the full B the backward execution will remain the same.

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly makes sense, I agree it is pointless to compute dI on stage 0. I need to revisit how the schedules are designed because I thought separate I was a common thing for stage 0 of zb schedules.

last_backward = self._seen_bwd_chunks == self.chunks - 1 # type: ignore[operator]
else:
# For backwards are split into weight and input, we will see twice as many bwd_chunks
# -1 because we skip the first bwd_chunk backward
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For another PR.. but in case you didn't see my comment on my own PR for merge bw, this logic will have to be rewritten since any stage may have some mix of I, W and B operations so we can't expect to do it by counting and expecting round numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, we can definitely rewrite this logic!

@H-Huang
Copy link
Member Author

H-Huang commented Oct 30, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 30, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
Ran into issues (pytorch#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

Pull Request resolved: pytorch#138925
Approved by: https://github.com/wconstab
@github-actions github-actions bot deleted the gh/H-Huang/150/head branch November 30, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged module: pipelining Pipeline Parallelism oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (pipeline) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants