KEMBAR78
[inductor] Fix int64 from MutationOutput Buffer by yushangdi · Pull Request #162020 · pytorch/pytorch · GitHub
Skip to content

Conversation

@yushangdi
Copy link
Contributor

@yushangdi yushangdi commented Sep 2, 2025

Summary:
When we have a user defined triton kernel, it marks the mutated outputs as MutationOutput with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to mutation_names in MutationOutput.

Test Plan:

buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel

Differential Revision: D81530083

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162020

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Unrelated Failures

As of commit 7f3734a with merge base 1aa7476 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81530083

@yushangdi yushangdi added the topic: not user facing topic category label Sep 2, 2025
@yushangdi yushangdi changed the title Fix int64 from MutationOutput Buffer [inductor] Fix int64 from MutationOutput Buffer Sep 2, 2025
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 2, 2025
Copy link
Contributor

@davidberard98 davidberard98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! LGTM with some caveats:

  • CI failures look legit - could you try to shrink the size of the test so that it doesn't OOM? (e.g. use int8 tensors) Also, you may need to do some skips depending on whether triton is available, etc.
  • maybe give @eellison ~a day to review in case there's something I'm missing in the IR/SIMD changes?

if buf.has_tensor_output()
]

for buf in buffers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any buffer that is a mutation output should also be an input. do you know why the input mutated buffer was not being accounted for in the buffers that are iterated over here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eellison I'm not sure. We just get two buffers in select_index_dtype, here buf2 is the outupt buffer and buf1 is the input buffer for the inductor-generated triton kernel, and it's the MutationOutput buffer.

(Pdb) self.scheduler_nodes()
(SchedulerNode(name='op2'),)
(Pdb) self.scheduler_nodes()[0].get_buffer_names()
OrderedSet(['buf2'])
(Pdb) self.scheduler_nodes()[0].used_buffer_names()
OrderedSet(['buf1', 'buf2'])
(Pdb) buffers
[ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[s97, 512], stride=[512, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f545d964d60>, ranges=[s97, 512])), MutationOutput(name='buf1', layout=NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]))]

Copy link
Contributor Author

@yushangdi yushangdi Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutation output is an input for the UserDefinedTritonKernel, but here the inductor generated triton kernel takes the output of the UserDefinedTritonKernel and has the mutation output has an input.

More specifically, buf2 is the writes and buf1(the MutationOutput) is the reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative fix is to change the line here:

MutationOutput(NoneLayout(device=self.device), buf, self)

to

 self.mutation_outputs = [
            MutationOutput(buf.layout, buf, self)
            for buf in self.mutable_args
        ]

I don't know why it's set to NoneLayout initially, maybe because the output's layout can potentially change?

yushangdi added a commit to yushangdi/pytorch that referenced this pull request Sep 3, 2025
Summary:

When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel.


When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Reviewed By: davidberard98

Differential Revision: D81530083
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81530083

yushangdi added a commit to yushangdi/pytorch that referenced this pull request Sep 3, 2025
Summary:
Pull Request resolved: pytorch#162020

When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Reviewed By: davidberard98

Differential Revision: D81530083
@yushangdi yushangdi added ciflow/mps Run MPS tests (subset of trunk) ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 3, 2025
@yushangdi yushangdi requested a review from eellison September 3, 2025 16:36
yushangdi added a commit to yushangdi/pytorch that referenced this pull request Sep 3, 2025
Summary:

When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel.


When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Reviewed By: davidberard98

Differential Revision: D81530083
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81530083

Summary:

When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel.


When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Reviewed By: davidberard98, eellison

Differential Revision: D81530083
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81530083

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test

Details for Dev Infra team Raised by workflow job

@jeanschmidt
Copy link
Contributor

@pytorchbot merge -f "codev should not fail merges if landed internally. If we want to prevent errors to be introduced in main, we should prevent the landing, not the merging"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Differential Revision: D81530083

Pull Request resolved: pytorch#162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Differential Revision: D81530083

Pull Request resolved: pytorch#162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput.

Test Plan:
```
buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel
```

Differential Revision: D81530083

Pull Request resolved: pytorch#162020
Approved by: https://github.com/davidberard98, https://github.com/eellison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged module: inductor topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants