KEMBAR78
Fix ConvolutionBinaryInplace using target node by oulgen · Pull Request #114436 · pytorch/pytorch · GitHub
Skip to content

Conversation

@oulgen
Copy link
Contributor

@oulgen oulgen commented Nov 23, 2023

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 23, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114436

Note: Links to docs will display an error until the docs builds have been completed.

❌ 35 New Failures, 39 Unrelated Failures

As of commit e75592d with merge base 8f8722e (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

oulgen added a commit that referenced this pull request Nov 23, 2023
Fixes #113440

ghstack-source-id: d056bda
Pull Request resolved: #114436
@oulgen
Copy link
Contributor Author

oulgen commented Nov 23, 2023

There's a bug in convolution_unary where it uses the target of ConvolutionBinaryInplace even though it needs to use the input[1]. I am having hard time finding where this happens so partially reverting this part until I can find it.

cc: @Chillee

EDIT: updated the PR with the correct fix

@oulgen oulgen added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Nov 23, 2023
@oulgen oulgen changed the title Partially revert #112925 to unblock mkldnn Fix ConvolutionBinaryInplace using target node Nov 23, 2023
This IR node mutates in place, it needs to use the argument no the
target.

Fixes #113440

[ghstack-poisoned]
oulgen added a commit that referenced this pull request Nov 23, 2023
This IR node mutates in place, it needs to use the argument no the
target.

Fixes #113440

Pull Request resolved: #114436
ghstack-source-id: f631332
Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The UT passes after this fix, but the model still failure. For example: python benchmarks/dynamo/torchbench.py --performance --float32 -dcpu -n50 --no-skip --dashboard --only resnet50 --inference --freezing --timeout 9000 --backend=inductor --output=test.csv

  • error message: return getattr(self, n.op)(n.target, args, kwargs) File "/home/leslie/inductor/pytorch/torch/_inductor/graph.py", line 589, in call_function return target(*args, **kwargs) File "/home/leslie/inductor/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 397, in fn assert isinstance(other, ir.TensorBox) torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: AssertionError:
  • for which I think due to we need the TensorBox instead of ExternKernelAlloc. Instead of current fix, how about we made some change here
    return packed
    to return packed.inputs[0], since the return value will be wrapped into TensorBox.

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Nov 23, 2023

cc @chuanqi129, since more than 1 models failed among fp32/bf16 testing, please help to confirm all the failures fixed by this PR.

This IR node mutates in place, it needs to use the argument not the
target.

Fixes #113440

[ghstack-poisoned]
oulgen added a commit that referenced this pull request Nov 23, 2023
This IR node mutates in place, it needs to use the argument no the
target.

Fixes #113440

Pull Request resolved: #114436
ghstack-source-id: bc5a7ab
@oulgen
Copy link
Contributor Author

oulgen commented Nov 23, 2023

@leslie-fang-intel updated the PR, can you check whether this fixes the other issue? If not, please share the full assertion failure

@leslie-fang-intel
Copy link
Collaborator

@leslie-fang-intel updated the PR, can you check whether this fixes the other issue? If not, please share the full assertion failure

cc @chuanqi129, please help to confirm

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Nov 23, 2023

@leslie-fang-intel updated the PR, can you check whether this fixes the other issue? If not, please share the full assertion failure

Thanks @oulgen, emmm.... I think we should not use inputs[1]. Per my understanding, inputs[1] and packed.inputs[0] might be different type. Since inputs[1] might be a TensorBox and unwrap into the packed.inputs[0] (Buffer) here

None, layout, self.unwrap_storage(inputs), constant_args, kwargs or {}

@oulgen
Copy link
Contributor Author

oulgen commented Nov 23, 2023

@leslie-fang-intel I don't quite follow what you mean. The op mutates input[1], and previously via MutationLayout it was returning input[1]. So, returning input[1] is the correct result.

@leslie-fang-intel
Copy link
Collaborator

leslie-fang-intel commented Nov 24, 2023

@leslie-fang-intel I don't quite follow what you mean. The op mutates input[1], and previously via MutationLayout it was returning input[1]. So, returning input[1] is the correct result.

I think inputs[1] equals to TensorBox(StorageBox((packed.inputs[0]))) , and we should return the buffer object (same as previously) instead of the Tensorbox here. Otherwise the model level testing (and also this test case) will still fail. So, maybe return packed.inputs[0] here instead of inputs[1].

This IR node mutates in place, it needs to use the argument not the
target.

Fixes #113440

[ghstack-poisoned]
oulgen added a commit that referenced this pull request Nov 24, 2023
This IR node mutates in place, it needs to use the argument no the
target.

Fixes #113440

Pull Request resolved: #114436
ghstack-source-id: 53e247a
@oulgen
Copy link
Contributor Author

oulgen commented Nov 24, 2023

@leslie-fang-intel Thanks! I missed the buffer wrapping when i moved the code around. Updated the code.

Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check ConvolutionBinaryInplace set get_mutation_names() properly.

@leslie-fang-intel
Copy link
Collaborator

@chuanqi129 is working on verifying this fix among the failures we saw in fp32/bf16 static/dynamic shape tests. Expect he wiill help to post the result in early next week

@oulgen
Copy link
Contributor Author

oulgen commented Nov 24, 2023

@pytorchbot merge

@oulgen
Copy link
Contributor Author

oulgen commented Nov 24, 2023

@pytorchbot abort

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 24, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'abort' (choose from 'merge', 'revert', 'rebase', 'label', 'drci')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@oulgen
Copy link
Contributor Author

oulgen commented Nov 24, 2023

Actually @jansel is correct, since init reorders the arguments, the get_mutation_names is incorrect. I will put a follow up PR to fix this. In general, reordering arguments will lead to bugs like this

@oulgen oulgen reopened this Nov 24, 2023
@oulgen oulgen closed this Nov 24, 2023
oulgen added a commit that referenced this pull request Nov 24, 2023
Init function reorders the arguments so the mutation actually happens on
argument input[0]

I am not sure if there's a good way to test this unfortunately.. Added
tests on #114436

[ghstack-poisoned]
oulgen added a commit that referenced this pull request Nov 24, 2023
Init function reorders the arguments so the mutation actually happens on
argument input[0]

I am not sure if there's a good way to test this unfortunately.. Added
tests on #114436

ghstack-source-id: 41d8a07
Pull Request resolved: #114501
@oulgen
Copy link
Contributor Author

oulgen commented Nov 24, 2023

#114501 fixes the mutation tracking bug. My apologies. I tried to cancel the merge but couldn't figure out how to do it.

@aakhundov
Copy link
Contributor

I tried to cancel the merge but couldn't figure out how to do it.

@eellison once mentioned that closing and reopening the PR should stop the merge. Although I never tried myself.

pytorchmergebot pushed a commit that referenced this pull request Nov 24, 2023
Init function reorders the arguments so the mutation actually happens on
argument input[0]

I am not sure if there's a good way to test this unfortunately.. Added
tests on #114436

Pull Request resolved: #114501
Approved by: https://github.com/leslie-fang-intel, https://github.com/aakhundov
@facebook-github-bot facebook-github-bot deleted the gh/oulgen/38/head branch November 27, 2023 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants