-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Inductor CUTLASS backend] Step 3: autotune_process, and CUDABenchmarkRequest #107901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CUDABenchmarkRequest. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107901
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 73c7297 with merge base f9a250c ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" CUDABenchmarkRequest. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jansel @aakhundov @kadeng , ptal~
torch/_inductor/autotune_process.py
Outdated
| assert ( | ||
| self.workspace_size == 0 | ||
| ), "Autotune cache needs to be updated to support non-zero workspace_size!" | ||
| self.workspace = torch.empty( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| VarRanges = Dict[sympy.Expr, sympy.Expr] | ||
|
|
||
|
|
||
| def do_bench_using_profiling(fn: Callable[[], Any], warmup=25, rep=100) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_bench also measures CPU-side overhead.
e.g. for an event sequence:
CUDA event begin -> CUDA launch kernel -> CUDA kernel execution -> CUDA event end
It measures time between CUDA event begin and CUDA event end, which contains "CUDA launch kernel" part. This part could take some CPU time. This is especially bad for CUTLASS kernels which rely on ctypes to invoke C++ functions from Python.
do_bench_using_profiling, on the other hand, relies on profiler to collect kernel device time.
| n_warmup = max(1, int(warmup / estimate_ms)) | ||
| n_repeat = max(1, int(rep / estimate_ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from Triton bench. I think it makes sure that there is enough time spent to warm up the device (so that the GPU frequency is set to max). If we only pass a count, for small kernels the warm up time might not be enough, while for big kernels the warm up time could be too long. n_repeat also makes sure that the total time doesn't vary a lot for small and big kernels.
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
test/inductor/test_inductor_utils.py
Outdated
|
|
||
| def test_do_bench(self): | ||
| res = do_bench(self._bench_fn) | ||
| log.error("do_bench result: %s", res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use log.warning instead?
test/inductor/test_inductor_utils.py
Outdated
|
|
||
| def test_do_bench_using_profiling(self): | ||
| res = do_bench_using_profiling(self._bench_fn) | ||
| log.error("do_bench_using_profiling result: %s", res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert something about the return value
| cls._bench_fn = functools.partial(torch.nn.functional.linear, x, w) | ||
|
|
||
| def test_do_bench(self): | ||
| res = do_bench(self._bench_fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests should assert something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a check to throw exception inside do_bench_using_profiling(), so the test below makes sure that no exception is thrown. I also use this two tests to compare latency collected by these two functions. We could skip the test_do_bench() test in CI though...
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
|
@pytorchbot label "topic: not user facing" |
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
…UDABenchmarkRequest" This is the step 3 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
This is the step 4 to add cutlass as an alternative inductor backend. Full tests can be found from the last PR in the stack. Feature request: #106991. Pull Request resolved: #107931 Approved by: https://github.com/aakhundov, https://github.com/jansel, https://github.com/kadeng ghstack dependencies: #107802, #107847, #107901
This is the step 5 to add cutlass as an alternative inductor backend. Feature request: #106991. Pull Request resolved: #108015 Approved by: https://github.com/kadeng, https://github.com/jansel, https://github.com/aakhundov ghstack dependencies: #107802, #107847, #107901, #107931
In #107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
In #107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
In #107901, the CUDA event based profiling is changed to profiler based profiling to avoid counting CPU-side kernel launch overhead in final latency numbers. However, it turns out that torch.profile() is significantly slower than CUDA event which affects model compilation speed quite significantlly. This PR changes back to CUDA event based profiling. Follow-ups: * Try CUDA event profiling with CUDAGraphs; * Multi-GPU profiling; Pull Request resolved: #109338 Approved by: https://github.com/frank-wei
This is the step 3 to add cutlass as an alternative inductor backend.
Full tests can be found from the last PR in the stack.
Feature request: #106991.
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov