-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add opmath_gpu_kernel_with_scalars and port add to use it #63884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a really grody conditional and multiple functors and it turns out none of it is necessary: the internal computation type for all variations of add is always the same (accscalar_t), and so all you need to do is fix the argument types of the functor to be the internal computation test, and the preexisting machinery will insert the conversions as necessary. There is one big problem with this, however: this disables launch_vectorized_kernel for the add kernel. This is because we advertise the kernel as taking accscalar_t argument, but scalar_t is what is in the tensor, so this triggers the dynamic_casting codepath in CUDALoops.cuh. I think it is possible to vectorize with casts but it will be a more involved change. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 167c9ab (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a really grody conditional and multiple functors and it turns out none of it is necessary: the internal computation type for all variations of add is always the same (accscalar_t), and so all you need to do is fix the argument types of the functor to be the internal computation test, and the preexisting machinery will insert the conversions as necessary. There is one big problem with this, however: this disables launch_vectorized_kernel for the add kernel. This is because we advertise the kernel as taking accscalar_t argument, but scalar_t is what is in the tensor, so this triggers the dynamic_casting codepath in CUDALoops.cuh. I think it is possible to vectorize with casts but it will be a more involved change. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 8d14939 Pull Request resolved: #63884
dynamic_casting codepath comes with what's close to catastrophic perf penalty (close to 2x). cc @zasdfgbnm |
Yes, vectorization is critical for Pascal and Ampere to achieve high memory bandwidth for small data types like half, bfloat16, etc. |
@ngimel I kind of feel like it should be possible to optimize this. We're memory bound right? It shouldn't be that expensive to work out the correct cast... |
… it" See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]
… it" See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: ecae09a Pull Request resolved: #63884
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
@ngimel @zasdfgbnm I came up with an updated strategy which I think actually works without perf hit. Tomorrow I'll benchmark and make sure for certain; code here has been updated. |
oops but it doesn't work LOL |
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Turns out to have been a simple brain-o |
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 93578a1 Pull Request resolved: #63884
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Benchmark script: ``` import torch import torch.utils.benchmark as benchmark results = [] for dtype in [torch.half, torch.float]: for size in [1000, 1000000, 1000000000]: print(f'{dtype} {size}') results.append( benchmark.Timer( stmt='x.add(y)', globals={ 'x': torch.randn(size, dtype=dtype, device='cuda'), 'y': torch.randn(size, dtype=dtype, device='cuda') }, label='perf', sub_label=str(dtype), description=str(size), ).blocked_autorange(min_run_time=1) ) compare = benchmark.Compare(results) compare.print() ``` Quadro GP100 Using the old gpu_kernel_with_scalars: ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 8.7 | 13.3 | 10796.9 torch.float32 | 8.5 | 24.1 | 21651.8 ``` Quadro GP100 Using the new gpu_kernel_with_scalars (no acc): ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 8.9 | 13.5 | 10794.7 torch.float32 | 8.8 | 24.0 | 21650.5 Times are in microseconds (us). ``` Quadro GP100 Using the new acc_gpu_kernel_with_scalars: ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 7.7 | 13.4 | 10795.7 torch.float32 | 7.8 | 23.9 | 21651.6 Times are in microseconds (us). ``` V100 old code: ``` | 100 | 100000 | 100000000 1 threads: ------------------------------------------ torch.int8 | 10.1 | 9.5 | 554.2 torch.int32 | 9.4 | 9.3 | 1411.2 torch.int64 | 9.4 | 9.7 | 3521.1 torch.float16 | 9.5 | 9.5 | 712.1 torch.float32 | 9.6 | 9.6 | 1410.7 ``` V100 new code: ``` | 100 | 100000 | 100000000 1 threads: ------------------------------------------ torch.int8 | 10.0 | 9.6 | 554.1 torch.int32 | 10.1 | 9.9 | 1411.2 torch.int64 | 10.5 | 10.5 | 3520.9 torch.float16 | 10.4 | 10.2 | 712.1 torch.float32 | 10.1 | 10.0 | 1411.3 ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Benchmark script: ``` import torch import torch.utils.benchmark as benchmark results = [] for dtype in [torch.half, torch.float]: for size in [1000, 1000000, 1000000000]: print(f'{dtype} {size}') results.append( benchmark.Timer( stmt='x.add(y)', globals={ 'x': torch.randn(size, dtype=dtype, device='cuda'), 'y': torch.randn(size, dtype=dtype, device='cuda') }, label='perf', sub_label=str(dtype), description=str(size), ).blocked_autorange(min_run_time=1) ) compare = benchmark.Compare(results) compare.print() ``` Quadro GP100 Using the old gpu_kernel_with_scalars: ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 8.7 | 13.3 | 10796.9 torch.float32 | 8.5 | 24.1 | 21651.8 ``` Quadro GP100 Using the new gpu_kernel_with_scalars (no acc): ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 8.9 | 13.5 | 10794.7 torch.float32 | 8.8 | 24.0 | 21650.5 Times are in microseconds (us). ``` Quadro GP100 Using the new acc_gpu_kernel_with_scalars: ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 7.7 | 13.4 | 10795.7 torch.float32 | 7.8 | 23.9 | 21651.6 Times are in microseconds (us). ``` V100 old code: ``` | 100 | 100000 | 100000000 1 threads: ------------------------------------------ torch.int8 | 10.1 | 9.5 | 554.2 torch.int32 | 9.4 | 9.3 | 1411.2 torch.int64 | 9.4 | 9.7 | 3521.1 torch.float16 | 9.5 | 9.5 | 712.1 torch.float32 | 9.6 | 9.6 | 1410.7 ``` V100 new code: ``` | 100 | 100000 | 100000000 1 threads: ------------------------------------------ torch.int8 | 10.0 | 9.6 | 554.1 torch.int32 | 10.1 | 9.9 | 1411.2 torch.int64 | 10.5 | 10.5 | 3520.9 torch.float16 | 10.4 | 10.2 | 712.1 torch.float32 | 10.1 | 10.0 | 1411.3 ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]
@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Can you check if binary size has changed? |
In fact, binary size goes down:
|
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 6715caa Pull Request resolved: pytorch#63884
Stack from ghstack:
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here. (NB: the discuss post uses
accscalar_t
as the compute type, but we have since split it off into its own typeopmath_t
.)Benchmark script:
Quadro GP100 Using the old gpu_kernel_with_scalars:
Quadro GP100 Using the new gpu_kernel_with_scalars (no acc):
Quadro GP100 Using the new acc_gpu_kernel_with_scalars:
V100 old code:
V100 new code:
Signed-off-by: Edward Z. Yang ezyang@fb.com
Differential Revision: D30545296