dont let partitioner think it can fuse pointwise ops into user triton kernels #136878

bdhirsh · 2024-09-27T18:26:32Z

Previously if we had a graph like:

        triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
        getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
        getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
        sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
        mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)

The partitioner would assume that the sigmoid() could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:

(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)

(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)

Reviewed By: embg

Differential Revision: D63551393

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-09-27T18:26:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136878

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit db9375a with merge base f0fa460 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (similar failure)
##[error]Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-09-27T18:26:49Z

This pull request was exported from Phabricator. Differential Revision: D63551393

zou3519 · 2024-09-27T19:35:26Z

torch/_functorch/partitioners.py

Can you write a test?

Separately, what did the failure mode look like?

test added, also updated the description.

The failure mode was that the compiled forward from inductor contained 2 kernels (user_triton, dedicated_inductor_kernel_for_sigmoid), when the "better" case would have been to move the sigmoid() to the backward (so it could be fused into an existing inductor kernel in the backward)

facebook-github-bot · 2024-09-27T19:46:44Z

This pull request was exported from Phabricator. Differential Revision: D63551393

… kernels (pytorch#136878) Summary: Pull Request resolved: pytorch#136878 todo Test Plan: CI Reviewed By: embg Differential Revision: D63551393

facebook-github-bot · 2024-09-28T01:22:13Z

This pull request was exported from Phabricator. Differential Revision: D63551393

facebook-github-bot · 2024-09-28T02:48:50Z

This pull request was exported from Phabricator. Differential Revision: D63551393

facebook-github-bot · 2024-09-30T13:24:13Z

This pull request was exported from Phabricator. Differential Revision: D63551393

… kernels (pytorch#136878) Summary: Pull Request resolved: pytorch#136878 todo Test Plan: CI Reviewed By: embg Differential Revision: D63551393

facebook-github-bot · 2024-09-30T13:35:01Z

This pull request was exported from Phabricator. Differential Revision: D63551393

Chillee · 2024-09-28T01:54:32Z

torch/_functorch/partitioners.py

            return True
        if can_fuse_into_triton_kernel_wrapper_functional(a, b):
            return True
+        if (


Actually, I don't think this is the right thing to do 🤔

I think the root of the problem here is how we treat operator.getitem (we've run into other issues with views in the past). Basically, we're currently treating operator.getitem as a "fusible" op, but it's actually a "free" op/view, and I think that's actually morally different.

Totally agreed - I sent this to Richard but I was going to check with you - what do you think of something like this instead?

a = recursively_remove_getitems(a) b = recursively_remove_getitems(b) return op_types.is_fusible(a) and op_types.is_fusible(b)

since as you pointed out, we treat operator.getitem as "always fusible", which seems bad (aka any other ops that return tuples of tensors but are not themselves fusible might suffer in a similar way).

In terms of landing order, I was thinking of landing this change first since it's needed to unblock internal and is a bit less risky.

I'm fine with landing this first. Out of curiosity, does #126446 also solve this? Perhaps while also including operator.getitem in the view list?

Out of curiosity, does #126446 also solve this? Perhaps while also including operator.getitem in the view list?

Hmm it doesn't look like it (my local copy already has that change to always recompute views, and I also tried tweaking the list to include operator.getitem in the list of views). My local change:

diff --git a/torch/_functorch/partitioners.py b/torch/_functorch/partitioners.py index 81e2f297f6f..8c02fb68211 100644 --- a/torch/_functorch/partitioners.py +++ b/torch/_functorch/partitioners.py @@ -1293,6 +1293,7 @@ def get_default_op_list() -> OpTypes: aten.as_strided, aten.permute, aten.select, + operator.getitem, ] view_ops = recomputable_view_ops default_recomputable_ops += [

I tried running the same test locally that I have in my PR, and I'm still seeing sigmoid() get saved as an activation (even though it could be fused into an existing inductor backward kernel.

(my local test):

import torch import triton import triton.language as tl def test_triton_kernel_not_fusable_with_users(): @triton.jit def _sin_kernel( in_ptr0, out_ptr, out2_ptr, n_elements, BLOCK_SIZE: "tl.constexpr", ): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements x = tl.load(in_ptr0 + offsets, mask=mask) output = tl.sin(x) tl.store(out_ptr + offsets, output, mask=mask) tl.store(out2_ptr + offsets, output, mask=mask) from typing import List from torch._library import capture_triton, triton_op @triton_op("mylib::sin_kernel", mutates_args={}) def sin_kernel(x: torch.Tensor) -> List[torch.Tensor]: n_elements = x.numel() out = torch.empty_like(x) out2 = torch.empty_like(x) capture_triton(_sin_kernel)[(n_elements,)]( x, out, out2, n_elements, BLOCK_SIZE=4 ) return [out, out2] class MySin(torch.autograd.Function): @staticmethod def forward(ctx, x): out, saved = tuple(torch.ops.mylib.sin_kernel(x)) ctx.save_for_backward(x, saved) return out @staticmethod def backward(ctx, grad): (x, saved) = ctx.saved_tensors return grad * saved.sigmoid() * x @torch.compile(backend="aot_eager") def f(x): return MySin.apply(x) x = torch.randn(4, 4, requires_grad=True, device='cuda') out = f(x) test_triton_kernel_not_fusable_with_users()

facebook-github-bot · 2024-10-01T21:04:06Z

This pull request was exported from Phabricator. Differential Revision: D63551393

bdhirsh · 2024-10-02T03:41:01Z

@pytorchbot rebase

pytorchmergebot · 2024-10-02T03:42:29Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

… kernels (pytorch#136878) Summary: Pull Request resolved: pytorch#136878 todo Test Plan: CI Reviewed By: embg Differential Revision: D63551393

pytorchmergebot · 2024-10-02T03:42:32Z

Successfully rebased export-D63551393 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout export-D63551393 && git pull --rebase)

bdhirsh · 2024-10-02T03:42:50Z

@pytorchbot merge

pytorchmergebot · 2024-10-02T03:45:38Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-02T04:57:53Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100)

Details for Dev Infra team

Raised by workflow job

bdhirsh · 2024-10-02T13:45:14Z

@pytorchbot merge

pytorchmergebot · 2024-10-02T13:47:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added the ciflow/inductor label Sep 27, 2024

facebook-github-bot added the fb-exported label Sep 27, 2024

bdhirsh requested review from Chillee and zou3519 September 27, 2024 19:33

zou3519 reviewed Sep 27, 2024

View reviewed changes

zou3519 approved these changes Sep 27, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 27, 2024

bdhirsh added the release notes: composability release notes category label Sep 27, 2024

bdhirsh force-pushed the export-D63551393 branch from 0d75d3a to 8311f22 Compare September 27, 2024 19:46

pytorch-bot bot added the module: inductor label Sep 27, 2024

bdhirsh force-pushed the export-D63551393 branch from 8311f22 to 63e502d Compare September 28, 2024 00:11

bdhirsh force-pushed the export-D63551393 branch from 63e502d to 4651417 Compare September 28, 2024 00:47

bdhirsh force-pushed the export-D63551393 branch from 4651417 to 130bdfe Compare September 30, 2024 13:24

bdhirsh force-pushed the export-D63551393 branch from 130bdfe to a3bf5a6 Compare September 30, 2024 13:34

Chillee reviewed Sep 30, 2024

View reviewed changes

bdhirsh force-pushed the export-D63551393 branch from a3bf5a6 to f994cef Compare October 1, 2024 21:04

dont let partitioner think it can fuse pointwise ops into user triton…

db9375a

… kernels (pytorch#136878) Summary: Pull Request resolved: pytorch#136878 todo Test Plan: CI Reviewed By: embg Differential Revision: D63551393

pytorchmergebot force-pushed the export-D63551393 branch from f994cef to db9375a Compare October 2, 2024 03:42

pytorchmergebot added the merging label Oct 2, 2024

pytorchmergebot removed the merging label Oct 2, 2024

pytorchmergebot added the merging label Oct 2, 2024

pytorchmergebot added the Merged label Oct 2, 2024

pytorchmergebot closed this in bf73af4 Oct 2, 2024

pytorchmergebot removed the merging label Oct 2, 2024

dont let partitioner think it can fuse pointwise ops into user triton kernels #136878

dont let partitioner think it can fuse pointwise ops into user triton kernels #136878

Uh oh!

Conversation

bdhirsh commented Sep 27, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136878

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

facebook-github-bot commented Sep 27, 2024

Uh oh!

zou3519 Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 27, 2024

Uh oh!

facebook-github-bot commented Sep 28, 2024

Uh oh!

facebook-github-bot commented Sep 28, 2024

Uh oh!

facebook-github-bot commented Sep 30, 2024

Uh oh!

facebook-github-bot commented Sep 30, 2024

Uh oh!

Chillee Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

Chillee Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Sep 30, 2024

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 1, 2024

Uh oh!

bdhirsh commented Oct 2, 2024

Uh oh!

pytorchmergebot commented Oct 2, 2024

Uh oh!

pytorchmergebot commented Oct 2, 2024

Uh oh!

bdhirsh commented Oct 2, 2024

Uh oh!

pytorchmergebot commented Oct 2, 2024

Merge started

Uh oh!

pytorchmergebot commented Oct 2, 2024

Merge failed

Uh oh!

bdhirsh commented Oct 2, 2024

Uh oh!

pytorchmergebot commented Oct 2, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bdhirsh commented Sep 27, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 27, 2024 •

edited

Loading

Chillee Sep 30, 2024 •

edited

Loading