[Pipelining] Make PipelineStage support meta initialization #136243

wconstab · 2024-09-18T00:42:21Z

Stack from ghstack (oldest at bottom):

-> [Pipelining] Make PipelineStage support meta initialization #136243

Avoid allocating memory or dry-running the submodule during stage init.

Save user-provided input/output metadata during stage init, to allow
lazily initializing the buffers before the first step call.

Later, we plan to build on top of this to add lazy shape inference
(#130856) so that no input/output shapes are required at stage init.

For now, we require input/output tensors for stage init, but these
should be on meta device and stage should not allocate any real memory.

Note: this needs more thorough testing and review, but it worked on the
torchtitan 3d test.

TODO:

delete 'device' arg from PipelineStage ctor? (move it to inferred from
args tensors passed to first step call? separate PR.
delete 'output_args' from PipelineStage ctor? we don't actually need
it, but we use it to do shape validation, which is why I didn't remove
it in this PR. Proposal: leave it until we add lazy shape inference?

Fixes #136225, #136226

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-09-18T00:42:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136243

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74cfb12 with merge base failed to retrieve merge base, please contact dev infra:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 ghstack-source-id: 8a359b5 Pull Request resolved: #136243

Uses meta device for tensors/model used before pipeline splitting. *Important:* Relies on pytorch/pytorch#136243 to make PipelineStage avoid materializing the model and the input/output buffers eagerly. Relies on existing .to(device) in train.py to finally materialize the model. ghstack-source-id: 66fa9f1 Pull Request resolved: #582

H-Huang

Thanks for the quick fix!

H-Huang · 2024-09-18T21:06:39Z

torch/distributed/pipelining/stage.py

+        self.inputs_meta = (
+            (input_args,) if isinstance(input_args, torch.Tensor) else input_args
+        )
+        self._configure_outputs_meta(


This will make output_args required (right now it is optional), otherwise initialization will fail right? To have a better transition, can we still run self.submod() with the input args and leave the output_args to be false. This will fail if the model is not on the same device as the input, but that is okay and we can say we expect them to be both on meta device.

you're right that this makes output_args required. I like your suggestion. If you're willing to let it be an error if users pass real inputs, I would propose to assert the module and inputs are on the same device, and then compute 'outputs_meta' based on inputs.

I was weighing whether to make it a requirement to pass input/module on meta device, but I suppose that is too restrictive. I do want to ensure that outputs_meta is stored on meta to avoid wasting memory. So maybe if the user passes cuda model/inputs I will convert the output to meta. Wdyt?

ok, i added output-shape inference back in, but i made it refuse to do inference on non-meta device, so some user code might still have to switch their inputs to meta. is this ok, or should i make it work for cuda?

It probably might be easier to just let it work for cuda if the users already have it on cuda. Then you wouldn't have to fix all the tests as you mentioned above. I think the main fix of the PR can just be removing the .to(device) in init which you did, and with that removal now it is up to the users to make sure that model(input_args) works if they pass it in.

Hmm, can I say that I prefer the previous version better? i.e. removal.
(See also my comments about on relaxing device constraint, which requires removal of dryrun.)

Hmm, I'm in the other boat, I think keeping the dry run makes more sense (as a temporary measure) until we get the lazy init working, otherwise we have to update all the tests to pass in both input_args and output_args and also requires users to do so as well.

Yeah, that's hard. Maybe at least add with torch.no_grad()?
I am actually not sure what side effect running a module has. Saving grad context, marking requires_grad?

Also, in forward_maybe_with_... we have special logic for DDP/FSDP modules, would this dryrun's escape be an issue? (Or are we just lucky?)

i'm going to leave this for later. its not any worse with this PR and we plan to remove it when adding lazy shape inference. sound ok?

Yep, sounds good.

kwen2501

Lgtm.
We have modified single-stage schedule at this point but not multi-stage ones?

wconstab · 2024-09-19T21:18:46Z

Oh, I should include multi stage schedule in this PR, I'll fix it before landing

kwen2501 · 2024-09-20T06:46:32Z

Confirm fixes #136225

[ghstack-poisoned]

Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 ghstack-source-id: 0a452fc Pull Request resolved: #136243

wconstab · 2024-09-20T17:10:21Z

torch/distributed/pipelining/schedules.py

        target: target for the loss function.
        losses: a list to store the losses for each microbatch.
        """
+        if not self._stages_initialized:


hm, should I move this logic inside _step_microbatches? it seems some of our tests are calling _step_microbatches. Do we allow this or do we require calling step()?

if we don't require calling step(), then should we also move this code inside _step_microbatches?

self._stage.clear_runtime_states()

why don't i just do this init inside stage._forward_one_chunk...?

Yeah. That would be a good place.
We may have to ask CUDA Graph to come capture after the first step(...) though.
But that may not be a big deal? And the current change moved the prepare from init to step anyway.

annoying: Stage class doesn't know "n_microbatches", which may have been intentional?

I am thinking about whether it is bad to let the schedule 'register' the number of microbatches during Schedule.init and then stage can use this value later when it performs initialization inside forward_one_chunk / backward_one_chunk?

wconstab · 2024-09-20T17:10:59Z

torch/distributed/pipelining/stage.py

        group: Optional[dist.ProcessGroup] = None,
        dw_builder: Optional[Callable[[], Callable[..., None]]] = None,
    ):
+        assert submodule.device == torch.device(


this assertion breaks tests. i guess itll break some existing usages. Is it a good thing to do to prevent bad practices in the future, or should i relax this and make it work if model+inputs are on cuda?

I think we should relax this. "meta" device is still a high-end thing for most users.
And, is nn.Module.device a real thing?

hmm, its not always possible to assert this- maybe there is another way to reliably check the device of the module?

AttributeError: 'FSDPSequential' object has no attribute 'device'

yea, maybe i should just drop the checks and perform the shape inference blindly, but then do a .to(meta) on the outputs. that should make things on par for what howard mentioned, and then we can delete it once we add lazy inference

FSDPSequential is a superset of 'nn.Sequential'. It means vanilla nn.Sequential does not have device (see below for repro). I guess what we really want to check whether model.parameters() are on meta? that should cover all cases

>>> import torch >>> model = torch.nn.Sequential(torch.nn.Linear(2,2)) >>> model.device Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data/users/weif/pytorch/torch/nn/modules/module.py", line 1931, in __getattr__ raise AttributeError( AttributeError: 'Sequential' object has no attribute 'device'

kwen2501 · 2024-09-20T17:24:29Z

torch/distributed/pipelining/stage.py

        group: Optional[dist.ProcessGroup] = None,
        dw_builder: Optional[Callable[[], Callable[..., None]]] = None,
    ):
+        assert submodule.device == torch.device(


I think we should relax this. "meta" device is still a high-end thing for most users.
And, is nn.Module.device a real thing?

torch/distributed/pipelining/stage.py

kwen2501 · 2024-09-20T17:35:01Z

torch/distributed/pipelining/stage.py

+        self.inputs_meta = (
+            (input_args,) if isinstance(input_args, torch.Tensor) else input_args
+        )
+        self._configure_outputs_meta(


Hmm, can I say that I prefer the previous version better? i.e. removal.
(See also my comments about on relaxing device constraint, which requires removal of dryrun.)

kwen2501 · 2024-09-20T17:37:23Z

torch/distributed/pipelining/stage.py

+        # TODO, (1) are we deleting output validation when we move to shape inference?
+        # (2) if not, we should support multiple outputs
+        assert (
+            len(outputs_meta) == 1
+        ), f"validation logic assumes single output, got {len(outputs_meta)} outputs "


Oh then should we disable validation logic (which an add-on protection) till it supports the multi-output case? Multi-output is pretty common. This assert may break a few tracer cases.

if this is a new regression, i can remove the assert, but i thought the assert would just make another error more explicit. The old code assumes 'output' is a single value and tries to validate against it. Lets see if CI fails and if not, make another PR to fix validation?

I see. Yep, if it is not a new assert, then let's see what CI says. :)

[ghstack-poisoned]

Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 ghstack-source-id: 955df68 Pull Request resolved: #136243

kwen2501 · 2024-09-20T20:15:51Z

torch/distributed/pipelining/stage.py

+        self.inputs_meta = (
+            (input_args,) if isinstance(input_args, torch.Tensor) else input_args
+        )
+        self._configure_outputs_meta(


Yep, sounds good.

kwen2501 · 2024-09-20T20:16:17Z

torch/distributed/pipelining/stage.py

+        # TODO, (1) are we deleting output validation when we move to shape inference?
+        # (2) if not, we should support multiple outputs
+        assert (
+            len(outputs_meta) == 1
+        ), f"validation logic assumes single output, got {len(outputs_meta)} outputs "


I see. Yep, if it is not a new assert, then let's see what CI says. :)

kwen2501 · 2024-09-20T22:20:36Z

@pytorchbot merge

pytorchmergebot · 2024-09-20T22:22:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-20T22:22:58Z

Merge failed

Reason: 13 mandatory check(s) failed. The first few are:

pull / linux-focal-py3.12-clang10 / test (default, 2, 4, lf.linux.2xlarge)
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge)
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 2, 3, linux.2xlarge)
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 2, 3, linux.2xlarge)
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 4, lf.linux.2xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]

Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 ghstack-source-id: b05610d Pull Request resolved: #136243

wconstab · 2024-09-21T00:39:26Z

@pytorchbot merge

pytorchmergebot · 2024-09-21T00:41:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-21T06:40:35Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

kwen2501 · 2024-09-21T07:17:32Z

@pytorchbot merge

pytorchmergebot · 2024-09-21T07:19:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Uses meta device for tensors/model used before pipeline splitting. *Important:* Relies on pytorch/pytorch#136243 to make PipelineStage avoid materializing the model and the input/output buffers eagerly. Relies on existing .to(device) in train.py to finally materialize the model. ghstack-source-id: c15282c Pull Request resolved: #588

Update

dd53234

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 18, 2024

wconstab mentioned this pull request Sep 18, 2024

[PP] Fix PP meta init pytorch/torchtitan#582

Closed

wconstab requested review from H-Huang, kwen2501 and tianyu-l September 18, 2024 00:46

wconstab added the release notes: distributed (pipeline) release notes category label Sep 18, 2024

H-Huang approved these changes Sep 18, 2024

View reviewed changes

kwen2501 approved these changes Sep 19, 2024

View reviewed changes

kwen2501 mentioned this pull request Sep 20, 2024

[Distributed] Separate prefill and decode pytorch/torchchat#1162

Merged

Update

8d5892c

[ghstack-poisoned]

wconstab commented Sep 20, 2024

View reviewed changes

kwen2501 reviewed Sep 20, 2024

View reviewed changes

Update

f18d269

[ghstack-poisoned]

Update

adfe82e

[ghstack-poisoned]

wconstab mentioned this pull request Sep 20, 2024

[pipelining] Improve recv buffer management #136380

Open

kwen2501 approved these changes Sep 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 20, 2024

pytorchmergebot added the merging label Sep 20, 2024

pytorchmergebot removed the merging label Sep 20, 2024

Update

74cfb12

[ghstack-poisoned]

pytorchmergebot added the merging label Sep 21, 2024

pytorchmergebot added the Merged label Sep 21, 2024

pytorchmergebot closed this in ea737e4 Sep 21, 2024

pytorchmergebot removed the merging label Sep 21, 2024

wconstab mentioned this pull request Sep 26, 2024

[PP] Fix PP meta init pytorch/torchtitan#588

Merged

kwen2501 mentioned this pull request Sep 28, 2024

[Distributed] Bump torch version pytorch/torchchat#1225

Closed

github-actions bot deleted the gh/wconstab/335/head branch October 22, 2024 20:34

tianyu-l mentioned this pull request Oct 28, 2024

adapt to pp fixes in pytorch tianyu-l/pytorch_intern24#30

Merged

[Pipelining] Make PipelineStage support meta initialization #136243

[Pipelining] Make PipelineStage support meta initialization #136243

Uh oh!

Conversation

wconstab commented Sep 18, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136243

✅ No Failures

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Sep 19, 2024

Uh oh!

kwen2501 commented Sep 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weifengpy Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab commented Sep 18, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 18, 2024 •

edited

Loading

kwen2501 Sep 20, 2024 •

edited

Loading

weifengpy Sep 20, 2024 •

edited

Loading