[ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices #152341

tenpercent · 2025-04-28T17:00:26Z

This PR adds code generation for CK-tile based universal gemm kernels to the CK backend for Inductor, and adds these kernels to autotune choices.

Unlike legacy-CK based kernels (which are generated by parsing the CK instances from CK library), we generate the set of instances by manually specifying the tuning parameters.

This PR introduces a new template for code generation, and compilation/autotuning is handled by the existing infrastructure.

Points of discussion:

For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime
CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed
Whether gemm problem size K is ever dynamic, since whenever it's not a compile-time constant, we need to perform a runtime dispatch between several kernels

** Testing **

Use the existing tests in test/inductor/test_ck_backend.py

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

cc @zjing14 @coconutruben @ThomasNing @amd-khushbu

pytorch-bot · 2025-04-28T17:00:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152341

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d36ad37 with merge base 72a3c8d ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

rocm / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.2) (gh) (trunk failure)
inductor/test_quantization.py::TestQuantization::test_activation_quantization_aten_without_scaling

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull Request Overview

This PR introduces initial support for CK tile universal GEMM by adding new template functions and integrating them into the GEMM selection flow.

Added new utility functions (use_ck_tile_template and use_ck_tile_gemm_template) in torch/_inductor/utils.py.
Updated torch/_inductor/kernel/mm.py to include CK tile choices in the GEMM tuning flow.
Introduced the CKTileTemplate class in torch/_inductor/codegen/rocm/ck_tile_template.py to generate CK tile–specific code.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
torch/_inductor/utils.py	Added utility functions to enable CK tile templates.
torch/_inductor/kernel/mm.py	Updated GEMM choice logic to consider CK tile templates.
torch/_inductor/codegen/rocm/ck_tile_template.py	New file implementing the CKTileTemplate for ROCm.

tenpercent · 2025-04-30T20:45:51Z

@pytorchbot label 'topic: not user facing'

Copilot

Pull Request Overview

This PR adds support for CK-tile based universal GEMM kernels to the ROCm CK backend for Inductor by extending the autotuning choices and introducing a new code generation template.

Adds a new CKTileGemmTemplate choice in the mm kernel dispatch.
Updates CK universal GEMM template to handle dynamic input nodes in kBatch computation.
Introduces a new file (ck_tile_template.py) for CK tile-based kernel code generation.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
torch/_inductor/kernel/mm.py	Adds import and invocation of CKTileGemmTemplate to extend autotune choices.
torch/_inductor/codegen/rocm/ck_universal_gemm_template.py	Incorporates a dynamic input check in _get_kBatch to enforce a kBatch value of [1] for dynamic inputs.
torch/_inductor/codegen/rocm/ck_tile_template.py	Provides a new template for generating CK-tile based universal GEMM kernel code.

Comments suppressed due to low confidence (1)

torch/_inductor/codegen/rocm/ck_universal_gemm_template.py:890

The early return for dynamic input nodes in _get_kBatch ensures a kBatch of [1]. Confirm that this is the intended behavior for all dynamic cases.

if is_dynamic(*self.input_nodes):

Copilot · 2025-05-05T16:41:25Z

torch/_inductor/kernel/mm.py

+        CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])



[nitpick] Consider adding a brief comment explaining the usage and behavior of CKTileGemmTemplate.add_choices to help maintainers understand its relationship to the legacy CKGemmTemplate.

Suggested change

CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])

# Add GEMM configurations using CKTileGemmTemplate. This template builds on the

# legacy CKGemmTemplate by introducing tiled GEMM operations for better performance

# in certain scenarios. It complements the configurations added by CKGemmTemplate.

CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])

Copilot · 2025-05-05T16:41:25Z

torch/_inductor/codegen/rocm/ck_tile_template.py

@@ -0,0 +1,60 @@
+import torch


[nitpick] Consider adding a file-level comment describing the purpose of CKTileTemplate to improve maintainability and assist new contributors.

coconutruben · 2025-05-05T18:37:38Z

For simplicity and reduced coupling with CK, the instance filter checks only data type and layout, and doesn't check the alignment requirement - meaning that more instances will be compiled than necessary - while keeping the code generation independent from internal CK logic which checks the alignment validity at runtime

hmm, I'd actually rather filter this out in torch rather than having to fail at runtime, but maybe the "check" can live in the rocm codebase?

CK-tile instances are enabled whenever legacy-CK instances are enabled. A config knob could be introduced to differentiate between the instance types if that's needed

can we make this another backend alltogether? I'd like to do something like = "TRITON,CK,CKTILE" if possible over time we can see whether that makes sense to keep but for now the extra knob will be helpful

Copilot

Pull Request Overview

This PR integrates CK-tile based universal gemm kernels into Inductor’s CK backend by adding a new code generation template and updating tuning configurations. The changes include:

Adding a new CKTileGemmTemplate (in mm.py and a new file ck_tile_template.py) to expand autotune choices.
Updating configuration keys (e.g. replacing n_max_profiling_configs with ck_max_profiling_configs) and compile command flags.
Adjusting test settings (e.g. compile_threads and profiling configs) to validate the new kernel implementations.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
torch/_inductor/kernel/mm.py	Added CKTileGemmTemplate to autotune choices alongside legacy CK kernel selection.
torch/_inductor/codegen/rocm/compile_command.py	Added the "-fvisibility=hidden" flag to the ROCm compiler options.
torch/_inductor/codegen/rocm/ck_universal_gemm_template.py	Updated dynamic problem handling and replaced profiling configuration keys with ck_max_profiling_configs.
torch/_inductor/codegen/rocm/ck_tile_template.py	New template for CK-tile based gemm kernels, providing necessary header and global setup.
torch/_inductor/codegen/rocm/ck_conv_template.py	Updated profiling configuration keys for CK conv instances.
test/inductor/test_ck_backend.py	Modified test parameters to leverage new configuration keys and increased compile threads.

Comments suppressed due to low confidence (4)

torch/_inductor/kernel/mm.py:654

[nitpick] Confirm that the ordering between the legacy CKGemmTemplate.add_ck_gemm_choices and the new CKTileGemmTemplate.add_choices is intentional and does not inadvertently affect the autotuning selection.

CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])

torch/_inductor/codegen/rocm/ck_universal_gemm_template.py:890

[nitpick] The early return of [1] for dynamic input cases is a clear design decision; please ensure that this behavior aligns with overall runtime dispatch expectations.

if is_dynamic(*self.input_nodes):

torch/_inductor/codegen/rocm/ck_universal_gemm_template.py:950

[nitpick] Ensure that replacing 'n_max_profiling_configs' with 'ck_max_profiling_configs' is consistently applied throughout the codebase and that related documentation is updated accordingly.

min(len(filtered_instances), config.rocm.ck_max_profiling_configs),

test/inductor/test_ck_backend.py:102

[nitpick] The increased compile_threads value and updated profiling configuration keys in tests should be verified against typical test hardware to avoid potential resource over-allocation.

"compile_threads": 16,

Copilot · 2025-05-07T02:06:37Z

torch/_inductor/codegen/rocm/compile_command.py

[nitpick] Please document the rationale and potential implications of adding the "-fvisibility=hidden" flag in this context to aid future maintainability.

Suggested change

"-fPIC",

"-fPIC",

# The -fvisibility=hidden flag is used to hide symbols by default in shared libraries.

# This reduces symbol export overhead, improves load times, and minimizes potential

# symbol conflicts. Symbols that need to be exported must be explicitly marked.

tenpercent · 2025-05-07T16:05:28Z

@pytorchbot rebase -s

pytorchmergebot · 2025-05-07T16:07:01Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-07T16:07:04Z

Successfully rebased ck-tile-universal-gemm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ck-tile-universal-gemm && git pull --rebase)

ThomasNing · 2025-05-08T06:46:24Z

torch/_inductor/codegen/rocm/ck_tile_universal_gemm_template.py

+                using Col = ck_tile::tensor_layout::gemm::ColumnMajor;
+
+                template <ck_tile::index_t PrefetchStages, typename Dispatcher>
+                void dispatch_memory_pipeline_hot_loop(const ck_tile::TailNumber tail_num, Dispatcher dispatch)


We have a new way of the TailNumber configuration in the CK Tile Engine. You could use that one to decrease the complexity of the code here.

Thanks! I'd rather skip it for now, as relying on c++ metaprogramming doesn't quite make it simpler

ThomasNing · 2025-05-08T06:55:48Z

torch/_inductor/codegen/rocm/ck_tile_universal_gemm_template.py

+            ("Row", "Col", "Row"),
+        ]
+        for (datatype_a, datatype_b, datatype_c) in [("FP16",) * 3, ("BF16",) * 3]
+        for (tile_m, tile_n, tile_k) in [(256, 256, 32), (256, 256, 64)]


We are preparing a search space for the ck tile gemm. Do we need to make it dynamic?

ThomasNing · 2025-05-08T07:05:02Z

torch/_inductor/codegen/rocm/ck_tile_universal_gemm_template.py

+
+        return True
+
+    def filter_op(self, op: "CKTileGemmOperation"):


I think for the filter function. We also need to have the following filter feature:

Filter the total size of the A and B Block Tile should be less than the 64KB.

Filter that if we split the warps in the certain dimension, we need to make sure that the Block Tile in one dimension is the multiple warp dim (e.g. 2 warps on M/N) * warp size in that dimension.

done, thanks!

ThomasNing · 2025-05-08T07:06:30Z

@tenpercent Thank you for the work Max! Just a few comments on the difference between the current in-progress ck tile codgen and profiler.

tenpercent · 2025-05-12T15:56:37Z

@pytorchbot rebase -s

pytorchmergebot · 2025-05-12T15:58:08Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

tenpercent · 2025-05-21T20:31:29Z

@pytorchbot rebase -s

pytorchmergebot · 2025-05-21T20:33:13Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

max-autotune torch.mm

to make more similar to cutlass and distinguish between ck and ck-tile

pytorchmergebot · 2025-05-21T20:33:16Z

Successfully rebased ck-tile-universal-gemm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ck-tile-universal-gemm && git pull --rebase)

tenpercent · 2025-05-21T20:34:11Z

@pytorchbot merge

pytorchmergebot · 2025-05-21T20:36:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 28, 2025

facebook-github-bot added the module: rocm AMD GPU support for Pytorch label Apr 28, 2025

tenpercent requested a review from Copilot April 28, 2025 17:00

Copilot AI reviewed Apr 28, 2025

View reviewed changes

pytorchbot added the open source label Apr 28, 2025

tenpercent force-pushed the ck-tile-universal-gemm branch 2 times, most recently from d52baca to 54ee7d8 Compare April 30, 2025 20:44

pytorch-bot bot added the topic: not user facing topic category label Apr 30, 2025

tenpercent changed the title ~~[Draft] Ck tile universal gemm~~ [ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices May 1, 2025

pytorch-bot bot added the ciflow/rocm Trigger "default" config CI on ROCm label May 1, 2025

tenpercent marked this pull request as ready for review May 5, 2025 16:40

tenpercent requested a review from Copilot May 5, 2025 16:40

Copilot AI reviewed May 5, 2025

View reviewed changes

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 6, 2025

tenpercent requested a review from Copilot May 7, 2025 02:05

Copilot AI reviewed May 7, 2025

View reviewed changes

pytorchmergebot force-pushed the ck-tile-universal-gemm branch from 1e126fa to 859c6fe Compare May 7, 2025 16:07

ThomasNing reviewed May 8, 2025

View reviewed changes

tenpercent added 19 commits May 21, 2025 20:33

render ck-tile based universal gemms and add them as choices to

4f3cb2d

max-autotune torch.mm

try to fix dynamic shape case

0920c59

pipeline codegen fixes

71368f4

lint import order

81d5488

add compute-v4 pipeline

d70e3d2

add block tile size filter for instances

7a61052

partition instances by the pipeline

808f5cb

refactor filter_op

9cb13c0

add alignment check

917d398

add docstring

69f0a9a

rename the config field for the max number of instances

7da2ed0

to make more similar to cutlass and distinguish between ck and ck-tile

config change

26cf6a3

remove redundant type aliases

9d148ab

lint

c310afb

lint

1e14778

add epilogue check to filter

db98916

add comment

2134680

add more static checks

b355e22

empty commit to retrigger ci

d36ad37

pytorchmergebot force-pushed the ck-tile-universal-gemm branch from ca51f78 to d36ad37 Compare May 21, 2025 20:33

pytorchmergebot added the merging label May 21, 2025

pytorchmergebot closed this in 7ef2c62 May 21, 2025

pytorchmergebot added Merged and removed merging labels May 21, 2025

		CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])

-        CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])
+        # Add GEMM configurations using CKTileGemmTemplate. This template builds on the
+        # legacy CKGemmTemplate by introducing tiled GEMM operations for better performance
+        # in certain scenarios. It complements the configurations added by CKGemmTemplate.
+        CKTileGemmTemplate.add_choices(choices, layout, [mat1, mat2])

-        "-fPIC",
+        "-fPIC",
+        # The -fvisibility=hidden flag is used to hide symbols by default in shared libraries.
+        # This reduces symbol export overhead, improves load times, and minimizes potential
+        # symbol conflicts. Symbols that need to be exported must be explicitly marked.

[ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices #152341

[ROCm][Inductor][CK] Add ck-tile based universal gemm kernels to torch.mm autotune choices #152341

Uh oh!

Conversation

tenpercent commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152341

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

tenpercent commented Apr 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 5, 2025

Choose a reason for hiding this comment

Uh oh!

coconutruben commented May 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 7, 2025

Choose a reason for hiding this comment

Uh oh!

tenpercent commented May 7, 2025

Uh oh!

pytorchmergebot commented May 7, 2025

Uh oh!

pytorchmergebot commented May 7, 2025

Uh oh!

ThomasNing May 8, 2025

Choose a reason for hiding this comment

Uh oh!

tenpercent May 15, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasNing May 8, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasNing May 8, 2025

Choose a reason for hiding this comment

Uh oh!

tenpercent May 15, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasNing commented May 8, 2025

Uh oh!

tenpercent commented May 12, 2025

Uh oh!

pytorchmergebot commented May 12, 2025

Uh oh!

tenpercent commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Uh oh!

tenpercent commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Merge started

Uh oh!

Reviewers

tenpercent commented Apr 28, 2025 •

edited

Loading

pytorch-bot bot commented Apr 28, 2025 •

edited

Loading