[reland] unflatten_tensor on compute stream for DTensorExtension #117020

wanchaol · 2024-01-09T05:03:28Z

Stack from ghstack (oldest at bottom):

-> [reland] unflatten_tensor on compute stream for DTensorExtension #117020

reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225

reland of #116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again [ghstack-poisoned]

pytorch-bot · 2024-01-09T05:03:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117020

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 386885d with merge base 6cf1fc6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ension" reland of #116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

reland of #116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again ghstack-source-id: 722d056 Pull Request resolved: #117020

…ension" reland of #116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

reland of #116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again ghstack-source-id: 5ea9d0f Pull Request resolved: #117020

awgu

LGTM since PR has already been verified internally!

awgu · 2024-01-09T19:12:46Z

test/distributed/fsdp/test_fsdp_tp_integration.py

+    def test_fsdp_tp_extension_grad(self):
        """
-        Tests TP + FSDP extension with consistent gradient layout
+        Tests TP + FSDP extension with correct gradient (i.e. no ACT)


What does ACT stand for?

AsyncCollectiveTensor

awgu · 2024-01-09T19:13:28Z

torch/distributed/tensor/parallel/fsdp.py

+        self.device_handle = device_handle
+        # we have to use the dynamo disable this way to disable dynamo as the decorater way would
+        # trigger build failure with torch deploy...
+        self.post_unflatten_transform = torch._dynamo.disable(self.post_unflatten_transform)  # type: ignore[method-assign]


IIUC, this is the main change?

wanchaol · 2024-01-09T21:01:32Z

@pytorchbot merge

pytorchmergebot · 2024-01-09T21:03:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…orch#117020) reland of pytorch#116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again Pull Request resolved: pytorch#117020 Approved by: https://github.com/awgu

Co-authored-by: Wanchao Liang <wanchaol@users.noreply.github.com> resolved: #116122 resolved: #117020 fixes #117126 resolved: #117336

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 9, 2024

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 9, 2024

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 9, 2024

wanchaol marked this pull request as draft January 9, 2024 06:24

wanchaol marked this pull request as ready for review January 9, 2024 06:48

wanchaol requested review from awgu, bdhirsh, fduwjj, izaitsevfb and yifuwang January 9, 2024 18:31

awgu approved these changes Jan 9, 2024

View reviewed changes

pytorchmergebot added the merging label Jan 9, 2024

pytorchmergebot added the Merged label Jan 9, 2024

pytorchmergebot closed this in 848cfe8 Jan 9, 2024

pytorchmergebot removed the merging label Jan 9, 2024

facebook-github-bot deleted the gh/wanchaol/422/head branch January 13, 2024 15:23

mvpatel2000 mentioned this pull request Jan 14, 2024

FSDP + DTensor Loss Flatlines Randomly #117471

Closed

atalman added this to the 2.2.1 milestone Jan 16, 2024

mvpatel2000 mentioned this pull request Feb 6, 2024

[v2.2.1] Release Tracker #119295

Closed

Skylion007 mentioned this pull request Feb 12, 2024

[FSDP][2D] Fix DTensor Extension Bugs #119690

Merged

atalman pushed a commit that referenced this pull request Feb 14, 2024

[FSDP][2D] Fix DTensor Extension Bugs (#119690)

95ea4e6

Co-authored-by: Wanchao Liang <wanchaol@users.noreply.github.com> resolved: #116122 resolved: #117020 fixes #117126 resolved: #117336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reland] unflatten_tensor on compute stream for DTensorExtension #117020

[reland] unflatten_tensor on compute stream for DTensorExtension #117020

Uh oh!

wanchaol commented Jan 9, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jan 9, 2024 •

edited

Loading

Uh oh!

awgu left a comment

Uh oh!

awgu Jan 9, 2024

Uh oh!

wanchaol Jan 9, 2024

Uh oh!

awgu Jan 9, 2024

Uh oh!

wanchaol Jan 9, 2024

Uh oh!

wanchaol commented Jan 9, 2024

Uh oh!

pytorchmergebot commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[reland] unflatten_tensor on compute stream for DTensorExtension #117020

[reland] unflatten_tensor on compute stream for DTensorExtension #117020

Uh oh!

Conversation

wanchaol commented Jan 9, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117020

✅ No Failures

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

awgu Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol Jan 9, 2024

Choose a reason for hiding this comment

Uh oh!

wanchaol commented Jan 9, 2024

Uh oh!

pytorchmergebot commented Jan 9, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wanchaol commented Jan 9, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 9, 2024 •

edited

Loading