Allow schedules to run with single stage #138925

H-Huang · 2024-10-25T18:03:29Z

Stack from ghstack (oldest at bottom):

-> Allow schedules to run with single stage #138925

Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

cc @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-25T18:03:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138925

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1388eb8 with merge base 2922b9f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5273db0 Pull Request resolved: #138925

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Ran into issues (#138863) when adding a Schedule with single stage for zero bubble, adding code to support this mostly for test purposes cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: f031067 Pull Request resolved: #138925

wconstab

Is it possible to fix this another way such that stage 0 still computes input grad separately?

wconstab · 2024-10-29T16:38:37Z

torch/distributed/pipelining/stage.py

-                )
+                grads_input = []
+                param_groups = []
+                # Skip the backward for the first stage since we will perform the weight update with


Does this mean that stage 0 will never run separate W/I computations even in multi stage pipelines?

I think this is a significant problem since in ZB it is more common to use separate W/I for earlier stages than late stages. Last stage may have almost entirely merged full-backward but first stage may need mostly separated ones to fill bubbles.

Stage 0 still computes W/I but now the I is like a no-op since the real work is in done in W. Typically the input grad would not be computed for stage 0 anyways since the inputs do not require gradients and this skips the .grad() call entirely.

This is only for the case of W/I split, for the full B the backward execution will remain the same.

wconstab

This mostly makes sense, I agree it is pointless to compute dI on stage 0. I need to revisit how the schedules are designed because I thought separate I was a common thing for stage 0 of zb schedules.

wconstab · 2024-10-30T13:34:36Z

torch/distributed/pipelining/stage.py

            last_backward = self._seen_bwd_chunks == self.chunks - 1  # type: ignore[operator]
        else:
            # For backwards are split into weight and input, we will see twice as many bwd_chunks
+            # -1 because we skip the first bwd_chunk backward


For another PR.. but in case you didn't see my comment on my own PR for merge bw, this logic will have to be rewritten since any stage may have some mix of I, W and B operations so we can't expect to do it by counting and expecting round numbers.

Got it, we can definitely rewrite this logic!

H-Huang · 2024-10-30T15:14:04Z

@pytorchbot merge

pytorchmergebot · 2024-10-30T15:15:41Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Ran into issues (pytorch#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) Pull Request resolved: pytorch#138925 Approved by: https://github.com/wconstab

temp changes for fix

77f7c5f

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 25, 2024

H-Huang added a commit that referenced this pull request Oct 25, 2024

temp changes for fix

6a9eab0

ghstack-source-id: 5273db0 Pull Request resolved: #138925

H-Huang marked this pull request as draft October 25, 2024 18:03

H-Huang added release notes: distributed (pipeline) release notes category module: pipelining Pipeline Parallelism labels Oct 26, 2024

Update on "temp changes for fix"

a569ef7

cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

H-Huang changed the title ~~temp changes for fix~~ Allow schedules to run with single stage Oct 28, 2024

Update on "Allow schedules to run with single stage"

d0f95ad

Ran into issues (#138863) when adding a Schedule with single stage for zero bubble, adding code to support this mostly for test purposes cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

H-Huang marked this pull request as ready for review October 28, 2024 20:11

H-Huang requested review from kwen2501 and wconstab October 28, 2024 20:11

Update on "Allow schedules to run with single stage"

1388eb8

Ran into issues (#138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes) cc awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

H-Huang added a commit that referenced this pull request Oct 28, 2024

Allow schedules to run with single stage

66cbd40

ghstack-source-id: f031067 Pull Request resolved: #138925

wconstab reviewed Oct 29, 2024

View reviewed changes

wconstab approved these changes Oct 30, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 30, 2024

pytorchmergebot added the merging label Oct 30, 2024

pytorchmergebot added the Merged label Oct 30, 2024

pytorchmergebot closed this in f4ab8b4 Oct 30, 2024

pytorchmergebot removed the merging label Oct 30, 2024

github-actions bot deleted the gh/H-Huang/150/head branch November 30, 2024 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow schedules to run with single stage #138925

Allow schedules to run with single stage #138925

Uh oh!

H-Huang commented Oct 25, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 25, 2024 •

edited

Loading

Uh oh!

wconstab left a comment

Uh oh!

wconstab Oct 29, 2024

Uh oh!

H-Huang Oct 29, 2024

Uh oh!

wconstab left a comment

Uh oh!

wconstab Oct 30, 2024

Uh oh!

H-Huang Oct 30, 2024

Uh oh!

H-Huang commented Oct 30, 2024

Uh oh!

pytorchmergebot commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow schedules to run with single stage #138925

Allow schedules to run with single stage #138925

Uh oh!

Conversation

H-Huang commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138925

✅ No Failures

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

H-Huang commented Oct 30, 2024

Uh oh!

pytorchmergebot commented Oct 30, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

H-Huang commented Oct 25, 2024 •

edited

Loading

pytorch-bot bot commented Oct 25, 2024 •

edited

Loading