[Pipelining] Support separate dI / dW and V-schedules #131762

wconstab · 2024-07-25T13:13:39Z

Stack from ghstack (oldest at bottom):

Separate dI / dW:

PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.

Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles. In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.

V-schedules:

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process. We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward. Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-07-25T13:13:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131762

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dda57c0 with merge base failed to retrieve merge base, please contact dev infra:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

TODO- consider redoing the IR names so that BW is just 'B' and B, W, are Bx and Bw or something ghstack-source-id: 4d23ad4 Pull Request resolved: #131762

[ghstack-poisoned]

TODO- consider redoing the IR names so that BW is just 'B' and B, W, are Bx and Bw or something ghstack-source-id: 22f0592 Pull Request resolved: #131762

[ghstack-poisoned]

TODO- consider redoing the IR names so that BW is just 'B' and B, W, are Bx and Bw or something ghstack-source-id: 2367da3 Pull Request resolved: #131762

[ghstack-poisoned]

TODO- consider redoing the IR names so that BW is just 'B' and B, W, are Bx and Bw or something ghstack-source-id: be6b293 Pull Request resolved: #131762

[ghstack-poisoned]

TODO- consider redoing the IR names so that BW is just 'B' and B, W, are Bx and Bw or something ghstack-source-id: 30408fa Pull Request resolved: #131762

github-actions · 2024-10-15T03:37:57Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

[ghstack-poisoned]

wconstab · 2024-10-25T17:15:59Z

test/distributed/pipelining/schedule_registry.py

        self.use_full_backward = False

        # Go through two microbatches
+        # TODO(whc) unify the semantics of the IR for old runtime with new runtime.


fixed in later PR in this stack

wconstab · 2024-10-25T17:16:08Z

torch/distributed/pipelining/schedules.py

                        ops.extend(stage.get_fwd_send_ops(mb_index))
-                    elif computation_type == _ComputationType.BACKWARD:
+
+                    # TODO(whc) for now i'm going with the hopefully backward-compatible position that legacy IR with


fixed in later PR in this stack

wconstab · 2024-10-25T17:16:51Z

torch/distributed/pipelining/schedules.py

                    return True
            return False
-        elif action.computation_type == B:
+        elif action.computation_type in (BACKWARD_INPUT, FULL_BACKWARD):


I think I should change this to just 'FULL_BACKWARD' to be consistent with the rest of this PR and then add BACKWARD_INPUT back in in the later PR where I fix other inconsistencies.

[ghstack-poisoned]

H-Huang · 2024-10-30T22:50:20Z

test/distributed/pipelining/artifacts/zb1p_2rank_2stagep_compute.csv

@@ -0,0 +1,2 @@
+0F0,0F1,2F0,,2F1,2I0,2W0,0F2,2I1,2W1,0F3,0I0,0W0,2F2,0I1,0W1,2F3,2I2,2W2,0F4,2I3,2W3,0F5,0I2,0W2,2F4,0I3,0W3,2F5,2I4,2W4,0F6,2I5,2W5,0F7,0I4,0W4,2F6,0I5,0W5,2F7,2I6,2W6,2I7,2W7,0I6,0W6,0I7,0W7


For my understanding, if you load the compute csv without the comms csv will it error? Or will it automatically determine the comms for you

within test_csv, it is explicit: we load the compute-only one, run the lowering passes, then compare the output of that with the saved comms one.

for real users, it is API'd somewhat: load_csv in PipelineScheduleRuntime accepts a kwarg for whether its a comms or compute csv. If its a compute one, then it will run add_send_recv.

I think before we more widely roll this out, we should better define the api around the lowering passes. Probably a function for 'lowering' the schedule and some config flags for any optional passes. For now its kind of manual: load_csv just hardcodes which passes to run.

[ghstack-poisoned]

wconstab · 2024-10-31T15:40:07Z