KEMBAR78
[tp] fix torch compile regression by wanchaol · Pull Request #111521 · pytorch/pytorch · GitHub
Skip to content

Conversation

@wanchaol
Copy link
Collaborator

@wanchaol wanchaol commented Oct 18, 2023

Stack from ghstack (oldest at bottom):

The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:

  1. use the old default prepare_input/output
  2. add the colwise/rowwise parallel test instead

The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 18, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111521

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 7f29dec with merge base 5c39552 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Comment on lines 485 to 486
_prepare_input=None,
_prepare_output=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this somehow revert the input_layouts and output_layouts? So does this mean when users want to use input_layouts and output_layouts, they have to set _prepare_input to None?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a quick fix, I'll do a big rewrite on top to make sure user behavior won't change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I gave up on the big rewrite and found a quicker solution right now, however it does not fundamentally fix the torch.compile problem in the new style input_layouts and output_layouts path (i.e. to configure sequence parallel torch.compile still fail). We should do a full rewrite soon to make the TP code cleaner and tracing friendly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this "hack" preserves the intended behavior, but I dislike it..

The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Oct 19, 2023
The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead

ghstack-source-id: 0865d50
Pull Request resolved: #111521
@wanchaol wanchaol added ciflow/trunk Trigger trunk jobs on your pull request release notes: distributed (dtensor) release notes category labels Oct 19, 2023
@wanchaol
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@wanchaol wanchaol added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 19, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead

[ghstack-poisoned]
wanchaol added a commit that referenced this pull request Oct 19, 2023
The most recent refactor of TP
#111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead

ghstack-source-id: 2dbb05c
Pull Request resolved: #111521
@wanchaol
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/wanchaol/376/head branch October 22, 2023 14:23
xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023
The most recent refactor of TP
pytorch#111160 breaks torch compile
path, so reverting the behavior back by:
1. use the old default prepare_input/output
2. add the colwise/rowwise parallel test instead
Pull Request resolved: pytorch#111521
Approved by: https://github.com/fduwjj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (dtensor) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants