[inductor] Fix int64 from MutationOutput Buffer #162020

yushangdi · 2025-09-02T22:43:57Z

Summary:
When we have a user defined triton kernel, it marks the mutated outputs as MutationOutput with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.

When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.

To get a hint on the MutationOutput size, we look at the buffers corresponding to mutation_names in MutationOutput.

Test Plan:

buck run mode/opt  fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel

Differential Revision: D81530083

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-09-02T22:44:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162020

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 2 Unrelated Failures

As of commit 7f3734a with merge base 1aa7476 ():

CANCELLED JOB - The following job was cancelled. Please retry:

linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-build / build (gh) (trunk failure)
##[error]The operation was canceled.
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 4, 6, linux.rocm.gpu.gfx942.1) (gh) (trunk failure)
inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_contiguous_transform_with_epilogue

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-09-02T22:44:14Z

This pull request was exported from Phabricator. Differential Revision: D81530083

davidberard98

Awesome, thanks! LGTM with some caveats:

CI failures look legit - could you try to shrink the size of the test so that it doesn't OOM? (e.g. use int8 tensors) Also, you may need to do some skips depending on whether triton is available, etc.
maybe give @eellison ~a day to review in case there's something I'm missing in the IR/SIMD changes?

eellison · 2025-09-03T13:44:00Z

torch/_inductor/codegen/simd.py

            if buf.has_tensor_output()
        ]

+        for buf in buffers:


any buffer that is a mutation output should also be an input. do you know why the input mutated buffer was not being accounted for in the buffers that are iterated over here ?

@eellison I'm not sure. We just get two buffers in select_index_dtype, here buf2 is the outupt buffer and buf1 is the input buffer for the inductor-generated triton kernel, and it's the MutationOutput buffer.

(Pdb) self.scheduler_nodes() (SchedulerNode(name='op2'),) (Pdb) self.scheduler_nodes()[0].get_buffer_names() OrderedSet(['buf2']) (Pdb) self.scheduler_nodes()[0].used_buffer_names() OrderedSet(['buf1', 'buf2']) (Pdb) buffers [ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[s97, 512], stride=[512, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f545d964d60>, ranges=[s97, 512])), MutationOutput(name='buf1', layout=NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]))]

The mutation output is an input for the UserDefinedTritonKernel, but here the inductor generated triton kernel takes the output of the UserDefinedTritonKernel and has the mutation output has an input.

More specifically, buf2 is the writes and buf1(the MutationOutput) is the reads.

An alternative fix is to change the line here:

pytorch/torch/_inductor/ir.py

Line 6862 in 9491d28

MutationOutput(NoneLayout(device=self.device), buf, self)

to

self.mutation_outputs = [ MutationOutput(buf.layout, buf, self) for buf in self.mutable_args ]

I don't know why it's set to NoneLayout initially, maybe because the output's layout can potentially change?

Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98 Differential Revision: D81530083

facebook-github-bot · 2025-09-03T16:33:43Z

This pull request was exported from Phabricator. Differential Revision: D81530083

Summary: Pull Request resolved: pytorch#162020 When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98 Differential Revision: D81530083

Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98 Differential Revision: D81530083

facebook-github-bot · 2025-09-03T17:10:42Z

This pull request was exported from Phabricator. Differential Revision: D81530083

Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98, eellison Differential Revision: D81530083

facebook-github-bot · 2025-09-03T20:32:27Z

This pull request was exported from Phabricator. Differential Revision: D81530083

facebook-github-bot · 2025-09-04T07:04:14Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-09-04T07:06:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-09-04T07:07:10Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test

Details for Dev Infra team

Raised by workflow job

jeanschmidt · 2025-09-04T09:46:11Z

@pytorchbot merge -f "codev should not fail merges if landed internally. If we want to prevent errors to be introduced in main, we should prevent the landing, not the merging"

pytorchmergebot · 2025-09-04T09:47:43Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: pytorch#162020 Approved by: https://github.com/davidberard98, https://github.com/eellison

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 2, 2025

facebook-github-bot added the fb-exported label Sep 2, 2025

yushangdi requested review from davidberard98 and eellison September 2, 2025 22:44

yushangdi added the topic: not user facing topic category label Sep 2, 2025

yushangdi changed the title ~~Fix int64 from MutationOutput Buffer~~ [inductor] Fix int64 from MutationOutput Buffer Sep 2, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 2, 2025

davidberard98 approved these changes Sep 3, 2025

View reviewed changes

eellison reviewed Sep 3, 2025

View reviewed changes

yushangdi force-pushed the export-D81530083 branch from 74ba46d to e4c70fe Compare September 3, 2025 16:28

yushangdi force-pushed the export-D81530083 branch from e4c70fe to 7529e27 Compare September 3, 2025 16:33

yushangdi added ciflow/mps Run MPS tests (subset of trunk) ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 3, 2025

yushangdi requested a review from eellison September 3, 2025 16:36

yushangdi force-pushed the export-D81530083 branch from 7529e27 to a735aea Compare September 3, 2025 17:10

eellison approved these changes Sep 3, 2025

View reviewed changes

yushangdi force-pushed the export-D81530083 branch from a735aea to 7f3734a Compare September 3, 2025 20:32

pytorchmergebot added the merging label Sep 4, 2025

pytorchmergebot removed the merging label Sep 4, 2025

pytorchmergebot added the merging label Sep 4, 2025

pytorchmergebot added the Merged label Sep 4, 2025

pytorchmergebot closed this in d67c29a Sep 4, 2025

pytorchmergebot removed the merging label Sep 4, 2025

[inductor] Fix int64 from MutationOutput Buffer #162020

[inductor] Fix int64 from MutationOutput Buffer #162020

Uh oh!

Conversation

yushangdi commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162020

❌ 1 Cancelled Job, 2 Unrelated Failures

Uh oh!

facebook-github-bot commented Sep 2, 2025

Uh oh!

davidberard98 left a comment

Choose a reason for hiding this comment

Uh oh!

eellison Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

yushangdi Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

yushangdi Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yushangdi Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 3, 2025

Uh oh!

facebook-github-bot commented Sep 3, 2025

Uh oh!

facebook-github-bot commented Sep 3, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge started

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge failed

Uh oh!

jeanschmidt commented Sep 4, 2025

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yushangdi commented Sep 2, 2025 •

edited

Loading

pytorch-bot bot commented Sep 2, 2025 •

edited

Loading

yushangdi Sep 3, 2025 •

edited

Loading