KEMBAR78
[reland] unflatten_tensor on compute stream for DTensorExtension by wanchaol · Pull Request #117020 · pytorch/pytorch · GitHub
Skip to content

Conversation

@wanchaol
Copy link
Collaborator

@wanchaol wanchaol commented Jan 9, 2024

Stack from ghstack (oldest at bottom):

reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @fduwjj @wz337 @tianyu-l @wconstab @yf225

reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 9, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117020

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 386885d with merge base 6cf1fc6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 9, 2024
@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 9, 2024
@wanchaol wanchaol marked this pull request as draft January 9, 2024 06:24
…ension"

reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Jan 9, 2024
reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

ghstack-source-id: 722d056
Pull Request resolved: #117020
@wanchaol wanchaol marked this pull request as ready for review January 9, 2024 06:48
…ension"

reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu fduwjj wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Jan 9, 2024
reland of #116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

ghstack-source-id: 5ea9d0f
Pull Request resolved: #117020
Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since PR has already been verified internally!

def test_fsdp_tp_extension_grad(self):
"""
Tests TP + FSDP extension with consistent gradient layout
Tests TP + FSDP extension with correct gradient (i.e. no ACT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does ACT stand for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncCollectiveTensor

self.device_handle = device_handle
# we have to use the dynamo disable this way to disable dynamo as the decorater way would
# trigger build failure with torch deploy...
self.post_unflatten_transform = torch._dynamo.disable(self.post_unflatten_transform) # type: ignore[method-assign]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the main change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@wanchaol
Copy link
Collaborator Author

wanchaol commented Jan 9, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

xadupre pushed a commit to xadupre/pytorch that referenced this pull request Jan 10, 2024
…orch#117020)

reland of pytorch#116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

Pull Request resolved: pytorch#117020
Approved by: https://github.com/awgu
@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/422/head branch January 13, 2024 15:23
@atalman atalman added this to the 2.2.1 milestone Jan 16, 2024
Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Feb 12, 2024
…orch#117020)

reland of pytorch#116559, which was reverted by internal.

The underlying reason for the revert is that the torch.dynamo.disable can't be used by the
pytorch codebase, as it's conflicting with some torch.deploy together, although the later one
only run some inference, but it somehow take that weird dependency on fsdp..

We have seen this issue with our functional collectives that we can't
use any dynamo components otherwise torch.deploy would complain..

verified internally that after removing torch.dynamo.disable the test
passed again

Pull Request resolved: pytorch#117020
Approved by: https://github.com/awgu
atalman pushed a commit that referenced this pull request Feb 14, 2024
Co-authored-by: Wanchao Liang <wanchaol@users.noreply.github.com>
resolved: #116122
resolved: #117020
fixes #117126
resolved: #117336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants