- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.7k
Description
🚀 The feature, motivation and pitch
During the lowering of Matmul/Conv into Inductor IR, a group of candidates including external library calls as well as triton template kernel is formed. These candidates will be benchmarked and tuned in the later fusion scheduling stage.
If we want to add a potentially faster third-party/custom kernel to these candidates we can only modify the pytorch source code. This feature suggests exposing an out-of-tree registration API for adding new candidates.
A user will be able to do:
from torch._inductor.external_kernels import register_external_matmul
# user specified callable
def custom_matmul_kernel(a, b, c):
...
# out-of-tree registration
register_external_matmul(custom_matmul_kernel)
After the registration, when the user calls the Inductor backend, this custom kernel will be among one of the tuning candidates.
There are several kernels that supports tuning including: Convolution, variants of Matmul and Flex attention. I'd like to hear from you guys if we can consider adding registration API for all these kernels.
This is a simple draft PR of this purposal #130774.
Jason @jansel , I don't know who to CC so please help me assign CCs.
Alternatives
No response
Additional context
No response
cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @aakhundov @ezyang @anijain2305 @peterbell10 @ColinPeppler @desertfire