[Inductor] Support user defined triton kernels in inductor #111434

oulgen · 2023-10-17T20:59:44Z

Stack from ghstack (oldest at bottom):

[inductor] Implement clone removal for user defined triton kernel via reinplace_scatters #111627
-> [Inductor] Support user defined triton kernels in inductor #111434

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

[ghstack-poisoned]

pytorch-bot · 2023-10-17T20:59:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111434

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4752953 with merge base bf01a7b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

oulgen · 2023-10-17T21:06:21Z

test/dynamo/test_functions.py

+    @common_utils.parametrize("backend", ["eager", "aot_eager", "inductor"])
+    @patch.object(torch._inductor.config, "implicit_fallbacks", False)
    def test_triton_kernel_native(self, grad, backend):
+        if backend == "inductor" and grad is True:


@zou3519 @bdhirsh

This is surprising. It fails with

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

in P857566445.

Looking at it with a debugger, we are calling copy_ on a faketensor that requires grad. Looks like aot_eager does not execute this codepath.

oulgen · 2023-10-17T21:07:49Z

torch/_inductor/codecache.py

    device_interface.Worker.set_device(device.index)
    kernel = TritonCodeCache.load(kernel_name, source_code)
-    kernel.precompile(warm_cache_only_with_cc=cc)
+    if hasattr(kernel, "precompile"):


@jansel It looks like all the existing kernels are CachingAutotuner but in my case I end up with JitFunction from triton. I assume in order to support @triton.autotune I'm gonna need to make this work with cachingautotuner but for the time being is this fine?

We should not use JitFunction from Triton, we want parallel ahead of time compiles.

You should be able to do something similar to:

pytorch/torch/_inductor/triton_heuristics.py

Lines 1115 to 1125 in c84c86f

def template(num_stages, num_warps, meta, filename=None):

"""

Compile a triton template

"""

return cached_autotune(

None,

[triton.Config({}, num_stages=num_stages, num_warps=num_warps)],

meta=meta,

heuristic_type=HeuristicType.TEMPLATE,

filename=filename,

)

And put a @template above the generated Triton kernel. You will need to generate proper meta.

oulgen · 2023-10-17T21:10:00Z

torch/_inductor/fx_passes/post_grad.py

            patterns.apply(gm.graph)
        if is_inference:
            inference_patterns.apply(gm.graph)
+        triton_patterns.apply(gm.graph)


Clones are removed as part of remove_noop_ops. Looking at the output code, I think the clones are still there, so i want to check whether to run remove_noop_ops again or move this check above

I think it's better to do this as part of reinplace_scatters.

The clone removal pass doesn't handle inplace mutations.

reinplace_scatters is exactly for this purpose - it converts scatter (i.e. clone + scatter_) into just scatter_. It also converts scatter + copy_ into just scatter_ as well.

oulgen · 2023-10-17T21:11:20Z

Putting up mostly as an RFC for now, on the next PR I will implement multiple kernels in the same function as well as kernels calling each other.

Example output: P857575342

torch/_inductor/ir.py

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 4cf43a4 Pull Request resolved: #111434

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: c1bbe84 Pull Request resolved: #111434

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

bdhirsh · 2023-10-19T07:15:58Z

torch/_inductor/fx_passes/post_grad.py

+    # It might be better to move this decomposition into lowering after some
+    # sort of clone removal pass at IR level is implemented. For the time being,
+    # decomposition is done at this level in order take advantage of
+    # existing clone removal.


oh I see - is inductor's clone removal pass done during lowering time, so if we desugar the functional triton_kernel op into mutable op + clones, it will be "too late"?

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

ghstack-source-id: 069cc9b Pull Request resolved: #111434

oulgen · 2023-10-19T22:05:01Z

Chatting with @Chillee we decided to move the decomposition to reinplace_scatters pass. I'm moving that out of this PR and will do as follow up.

This PR should be ready for review now

bdhirsh · 2023-10-20T03:57:21Z

torch/_inductor/ir.py

+    def codegen(self, wrapper):
+        from torch._higher_order_ops.triton_kernel_wrap import kernel_side_table
+
+        kernel = kernel_side_table.get_kernel(self.kernel_idx)


hmm... maybe a little late, but @Chillee I don't think that inductor's fx graph caching TORCHINDUCTOR_FX_GRAPH_CACHE=1 will play well with this kernel_side_table.

What do you think - should we just make sure not to add to the cache if our graph uses any user triton kernels, to unblock, or fully revisit the side table?

It seems to me like the FX graph that we get (both from dynamo and AOTAutograd) will burn in a random index that's supposed to map to a user triton kernel that dynamo saw:

def forward(self, x_1, output_1): triton_kernel_wrapper_functional_proxy = torch._higher_order_ops.triton_kernel_wrap.triton_kernel_wrapper_functional(kernel_idx = 0, grid = (5,), kwargs = {'in_ptr0': x_1, 'out_ptr': output_1, 'n_elements': 5, 'BLOCK_SIZE': 16}); x_1 = # getitem = triton_kernel_wrapper_functional_proxy['in_ptr0'] getitem_1 = triton_kernel_wrapper_functional_proxy['out_ptr'] getitem_2 = triton_kernel_wrapper_functional_proxy['n_elements'] getitem_3 = triton_kernel_wrapper_functional_proxy['BLOCK_SIZE']; triton_kernel_wrapper_functional_proxy = None return getitem_1

But if we were to map this graph to a cached, compiled inductor graph, we have no guarantee that that index will map to the same triton kernel.

Maybe we can have the cache key also include the kernel_side_table's mapping, from index to triton kernel (or a hash of its source code)?

Yeah this is why I was suggesting that it might be better to just include the Triton source code in the operator definition.

We would not only need to store the source code of one function but also every single dependancy. Another downside of that is the FX graph would be impossible to read.

If this is the preferred solution, I'm happy to change it but IMO caching kernel_side_table's mapping seems like a cleaner solution.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

jansel · 2023-10-21T04:27:43Z

torch/_inductor/codecache.py

    device_interface.Worker.set_device(device.index)
    kernel = TritonCodeCache.load(kernel_name, source_code)
-    kernel.precompile(warm_cache_only_with_cc=cc)
+    if hasattr(kernel, "precompile"):


We should not use JitFunction from Triton, we want parallel ahead of time compiles.

You should be able to do something similar to:

pytorch/torch/_inductor/triton_heuristics.py

Lines 1115 to 1125 in c84c86f

def template(num_stages, num_warps, meta, filename=None):

"""

Compile a triton template

"""

return cached_autotune(

None,

[triton.Config({}, num_stages=num_stages, num_warps=num_warps)],

meta=meta,

heuristic_type=HeuristicType.TEMPLATE,

filename=filename,

)

And put a @template above the generated Triton kernel. You will need to generate proper meta.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

oulgen · 2023-10-22T17:01:50Z

@pytorchbot merge

pytorchmergebot · 2023-10-22T17:03:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

… reinplace_scatters (#111627) Pull Request resolved: #111627 Approved by: https://github.com/jansel ghstack dependencies: #111434

…11434) Pull Request resolved: pytorch#111434 Approved by: https://github.com/jansel

… reinplace_scatters (pytorch#111627) Pull Request resolved: pytorch#111627 Approved by: https://github.com/jansel ghstack dependencies: pytorch#111434

…11434) Pull Request resolved: pytorch#111434 Approved by: https://github.com/jansel

… reinplace_scatters (pytorch#111627) Pull Request resolved: pytorch#111627 Approved by: https://github.com/jansel ghstack dependencies: pytorch#111434

[Inductor] Support user defined triton kernels in inductor

8cf94a4

[ghstack-poisoned]

github-actions bot added module: inductor module: dynamo ciflow/inductor labels Oct 17, 2023

oulgen commented Oct 17, 2023

View reviewed changes

oulgen requested review from Chillee, bdhirsh, jansel and zou3519 October 17, 2023 21:20

oulgen commented Oct 17, 2023

View reviewed changes

torch/_inductor/ir.py Show resolved Hide resolved

Update on "[Inductor] Support user defined triton kernels in inductor"

daa5d78

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

oulgen added a commit that referenced this pull request Oct 17, 2023

[Inductor] Support user defined triton kernels in inductor

68823f6

ghstack-source-id: 4cf43a4 Pull Request resolved: #111434

oulgen mentioned this pull request Oct 18, 2023

[Autograd] Track when mutations are for triton kernels #111500

Closed

Update on "[Inductor] Support user defined triton kernels in inductor"

7a7da27

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

oulgen added a commit that referenced this pull request Oct 19, 2023

[Inductor] Support user defined triton kernels in inductor

28e4cdc

ghstack-source-id: c1bbe84 Pull Request resolved: #111434

Update on "[Inductor] Support user defined triton kernels in inductor"

6662ec7

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

bdhirsh reviewed Oct 19, 2023

View reviewed changes

Update on "[Inductor] Support user defined triton kernels in inductor"

3314254

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

oulgen added a commit that referenced this pull request Oct 19, 2023

[Inductor] Support user defined triton kernels in inductor

c89f834

ghstack-source-id: 069cc9b Pull Request resolved: #111434

oulgen mentioned this pull request Oct 20, 2023

[inductor] Implement clone removal for user defined triton kernel via reinplace_scatters #111627

Closed

oulgen added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 20, 2023

bdhirsh reviewed Oct 20, 2023

View reviewed changes

Update on "[Inductor] Support user defined triton kernels in inductor"

eff32b2

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

jansel requested changes Oct 21, 2023

View reviewed changes

Update on "[Inductor] Support user defined triton kernels in inductor"

0098754

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

oulgen requested a review from jansel October 22, 2023 03:39

Update on "[Inductor] Support user defined triton kernels in inductor"

4752953

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

jansel approved these changes Oct 22, 2023

View reviewed changes

oulgen added the release notes: inductor label Oct 22, 2023

pytorchmergebot added the merging label Oct 22, 2023

pytorchmergebot added Merged and removed merging labels Oct 22, 2023

pytorchmergebot closed this in 977d3bc Oct 22, 2023

pytorchmergebot pushed a commit that referenced this pull request Oct 22, 2023

[inductor] Implement clone removal for user defined triton kernel via…

2b2b6ca

… reinplace_scatters (#111627) Pull Request resolved: #111627 Approved by: https://github.com/jansel ghstack dependencies: #111434

facebook-github-bot deleted the gh/oulgen/7/head branch October 26, 2023 14:24

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Nov 7, 2023

[Inductor] Support user defined triton kernels in inductor (pytorch#1…

523cb57

…11434) Pull Request resolved: pytorch#111434 Approved by: https://github.com/jansel

Skylion007 pushed a commit to Skylion007/pytorch that referenced this pull request Nov 14, 2023

[Inductor] Support user defined triton kernels in inductor (pytorch#1…

a08829f

…11434) Pull Request resolved: pytorch#111434 Approved by: https://github.com/jansel

	def template(num_stages, num_warps, meta, filename=None):
	"""
	Compile a triton template
	"""
	return cached_autotune(
	None,
	[triton.Config({}, num_stages=num_stages, num_warps=num_warps)],
	meta=meta,
	heuristic_type=HeuristicType.TEMPLATE,
	filename=filename,
	)

[Inductor] Support user defined triton kernels in inductor #111434

[Inductor] Support user defined triton kernels in inductor #111434

Uh oh!

Conversation

oulgen commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111434

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oulgen Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oulgen commented Oct 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oulgen commented Oct 19, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oulgen commented Oct 22, 2023

Uh oh!

pytorchmergebot commented Oct 22, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

oulgen commented Oct 17, 2023 •

edited

Loading

pytorch-bot bot commented Oct 17, 2023 •

edited

Loading

oulgen Oct 17, 2023 •

edited

Loading

oulgen commented Oct 17, 2023 •

edited

Loading