-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[inductor] Fix int64 from MutationOutput Buffer #162020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162020
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Cancelled Job, 2 Unrelated FailuresAs of commit 7f3734a with merge base 1aa7476 ( CANCELLED JOB - The following job was cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D81530083 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks! LGTM with some caveats:
- CI failures look legit - could you try to shrink the size of the test so that it doesn't OOM? (e.g. use int8 tensors) Also, you may need to do some skips depending on whether triton is available, etc.
- maybe give @eellison ~a day to review in case there's something I'm missing in the IR/SIMD changes?
| if buf.has_tensor_output() | ||
| ] | ||
|
|
||
| for buf in buffers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any buffer that is a mutation output should also be an input. do you know why the input mutated buffer was not being accounted for in the buffers that are iterated over here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eellison I'm not sure. We just get two buffers in select_index_dtype, here buf2 is the outupt buffer and buf1 is the input buffer for the inductor-generated triton kernel, and it's the MutationOutput buffer.
(Pdb) self.scheduler_nodes()
(SchedulerNode(name='op2'),)
(Pdb) self.scheduler_nodes()[0].get_buffer_names()
OrderedSet(['buf2'])
(Pdb) self.scheduler_nodes()[0].used_buffer_names()
OrderedSet(['buf1', 'buf2'])
(Pdb) buffers
[ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[s97, 512], stride=[512, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f545d964d60>, ranges=[s97, 512])), MutationOutput(name='buf1', layout=NoneLayout(device=device(type='cuda', index=0), size=[0], stride=[0]))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutation output is an input for the UserDefinedTritonKernel, but here the inductor generated triton kernel takes the output of the UserDefinedTritonKernel and has the mutation output has an input.
More specifically, buf2 is the writes and buf1(the MutationOutput) is the reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative fix is to change the line here:
Line 6862 in 9491d28
| MutationOutput(NoneLayout(device=self.device), buf, self) |
to
self.mutation_outputs = [
MutationOutput(buf.layout, buf, self)
for buf in self.mutable_args
]
I don't know why it's set to NoneLayout initially, maybe because the output's layout can potentially change?
74ba46d to
e4c70fe
Compare
Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98 Differential Revision: D81530083
|
This pull request was exported from Phabricator. Differential Revision: D81530083 |
Summary: Pull Request resolved: pytorch#162020 When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98 Differential Revision: D81530083
e4c70fe to
7529e27
Compare
Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98 Differential Revision: D81530083
7529e27 to
a735aea
Compare
|
This pull request was exported from Phabricator. Differential Revision: D81530083 |
Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with am empty layout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Reviewed By: davidberard98, eellison Differential Revision: D81530083
a735aea to
7f3734a
Compare
|
This pull request was exported from Phabricator. Differential Revision: D81530083 |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-binary-manywheel-rocm / manywheel-py3_9-rocm6_4-test Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "codev should not fail merges if landed internally. If we want to prevent errors to be introduced in main, we should prevent the landing, not the merging" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: pytorch#162020 Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: pytorch#162020 Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary: When we have a user defined triton kernel, it marks the mutated outputs as `MutationOutput` with a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel. When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it. To get a hint on the MutationOutput size, we look at the buffers corresponding to `mutation_names` in MutationOutput. Test Plan: ``` buck run mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_autotune_int64_user_defined_triton_kernel ``` Differential Revision: D81530083 Pull Request resolved: pytorch#162020 Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary:
When we have a user defined triton kernel, it marks the mutated outputs as
MutationOutputwith a NoneLayout. This MutationOutput may later be used as input to another inductor-generated triton kernel.When we determine whether to use int32 or int64 for the inductor generated triton kernel, we need to look at the number of elements for all buffers involved. If one of the buffer is a MutationOutput, we should still consider it's number of elements, instead of skipping it.
To get a hint on the MutationOutput size, we look at the buffers corresponding to
mutation_namesin MutationOutput.Test Plan:
Differential Revision: D81530083
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben