[Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest #107901

ipiszy · 2023-08-24T21:47:25Z

This is the step 3 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.

Feature request: #106991.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

CUDABenchmarkRequest. [ghstack-poisoned]

pytorch-bot · 2023-08-24T21:47:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107901

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 73c7297 with merge base f9a250c ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

lintrunner / linux-job (gh)

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/autotune_process.py

torch/_inductor/utils.py

…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy

Thanks @jansel @aakhundov @kadeng , ptal~

ipiszy · 2023-09-06T04:18:09Z

torch/_inductor/autotune_process.py

+        assert (
+            self.workspace_size == 0
+        ), "Autotune cache needs to be updated to support non-zero workspace_size!"
+        self.workspace = torch.empty(


ipiszy · 2023-09-06T22:13:32Z

torch/_inductor/utils.py

 VarRanges = Dict[sympy.Expr, sympy.Expr]


+def do_bench_using_profiling(fn: Callable[[], Any], warmup=25, rep=100) -> float:


do_bench also measures CPU-side overhead.

e.g. for an event sequence:
CUDA event begin -> CUDA launch kernel -> CUDA kernel execution -> CUDA event end
It measures time between CUDA event begin and CUDA event end, which contains "CUDA launch kernel" part. This part could take some CPU time. This is especially bad for CUTLASS kernels which rely on ctypes to invoke C++ functions from Python.

do_bench_using_profiling, on the other hand, relies on profiler to collect kernel device time.

ipiszy · 2023-09-06T22:15:52Z

torch/_inductor/utils.py

+    n_warmup = max(1, int(warmup / estimate_ms))
+    n_repeat = max(1, int(rep / estimate_ms))


This is copied from Triton bench. I think it makes sure that there is enough time spent to warm up the device (so that the GPU frequency is set to max). If we only pass a count, for small kernels the warm up time might not be enough, while for big kernels the warm up time could be too long. n_repeat also makes sure that the total time doesn't vary a lot for small and big kernels.

…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

aakhundov · 2023-09-07T08:53:21Z

test/inductor/test_inductor_utils.py

+
+    def test_do_bench(self):
+        res = do_bench(self._bench_fn)
+        log.error("do_bench result: %s", res)


Maybe use log.warning instead?

jansel · 2023-09-07T19:54:36Z

test/inductor/test_inductor_utils.py

+
+    def test_do_bench_using_profiling(self):
+        res = do_bench_using_profiling(self._bench_fn)
+        log.error("do_bench_using_profiling result: %s", res)


assert something about the return value

jansel · 2023-09-07T19:55:09Z

test/inductor/test_inductor_utils.py

+        cls._bench_fn = functools.partial(torch.nn.functional.linear, x, w)
+
+    def test_do_bench(self):
+        res = do_bench(self._bench_fn)


tests should assert something

There is a check to throw exception inside do_bench_using_profiling(), so the test below makes sure that no exception is thrown. I also use this two tests to compare latency collected by these two functions. We could skip the test_do_bench() test in CI though...

…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy · 2023-09-09T18:33:40Z

@pytorchbot label "topic: not user facing"

…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. Pull Request resolved: #107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901

This is the step 5 to add cutlass as an alternative inductor backend. Feature request: #106991. Pull Request resolved: #108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931

In #107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

In #107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; Pull Request resolved: #109338 Approved by: https://github.com/frank-wei

[Inductor CUTLASS backend] Step 3: autotune_process, and

3df3224

CUDABenchmarkRequest. [ghstack-poisoned]

ipiszy mentioned this pull request Aug 24, 2023

[Inductor CUTLASS backend] Step 1: Inductor config for cuda / cutlass, util functions. #107802

Closed

ipiszy mentioned this pull request Aug 24, 2023

[Inductor CUTLASS backend] Step 2: CUDACodeCache #107847

Closed

github-actions bot added module: inductor ciflow/inductor labels Aug 24, 2023

ipiszy changed the title ~~[Inductor CUTLASS backend] Step 3: autotune_process, and~~ [Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest Aug 24, 2023

This was referenced Aug 25, 2023

[Inductor CUTLASS backend] Step 4: CUDA (template) kernels #107931

Closed

[Inductor CUTLASS backend] Step 5: CUTLASS gemm kernels #107933

Closed