Add opmath_gpu_kernel_with_scalars and port add to use it #63884

ezyang · 2021-08-24T20:21:36Z

Stack from ghstack:

Convert mul to use opmath_gpu_kernel_with_scalars #64019 Convert mul to use opmath_gpu_kernel_with_scalars
Add opmath_gpu_kernel_with_scalars and port add to use it #63884 Add acc_gpu_kernel_with_scalars and port add to use it

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here. (NB: the discuss post uses accscalar_t as the compute type, but we have since split it off into its own type opmath_t.)

Benchmark script:

import torch
import torch.utils.benchmark as benchmark
        
results = []
for dtype in [torch.half, torch.float]:
    for size in [1000, 1000000, 1000000000]:
        print(f'{dtype} {size}')
        results.append(
            benchmark.Timer(
                stmt='x.add(y)',
                globals={
                    'x': torch.randn(size, dtype=dtype, device='cuda'),
                    'y': torch.randn(size, dtype=dtype, device='cuda')
                },
                label='perf',
                sub_label=str(dtype),
                description=str(size),
            ).blocked_autorange(min_run_time=1)
        )

compare = benchmark.Compare(results)
compare.print()

Quadro GP100 Using the old gpu_kernel_with_scalars:

                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  8.7   |    13.3   |   10796.9  
      torch.float32  |  8.5   |    24.1   |   21651.8

Quadro GP100 Using the new gpu_kernel_with_scalars (no acc):

                     |  1000  |  1000000  |  1000000000                                                            
1 threads: --------------------------------------------                                                            
      torch.float16  |  8.9   |    13.5   |   10794.7                                                              
      torch.float32  |  8.8   |    24.0   |   21650.5                                                                                                                                                                                 Times are in microseconds (us).

Quadro GP100 Using the new acc_gpu_kernel_with_scalars:

                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  7.7   |    13.4   |   10795.7  
      torch.float32  |  7.8   |    23.9   |   21651.6                                                                                                                                                                 Times are in microseconds (us).

V100 old code:

                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.1  |   9.5    |     554.2 
      torch.int32    |   9.4  |   9.3    |    1411.2 
      torch.int64    |   9.4  |   9.7    |    3521.1 
      torch.float16  |   9.5  |   9.5    |     712.1 
      torch.float32  |   9.6  |   9.6    |    1410.7

V100 new code:

                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.0  |    9.6   |     554.1 
      torch.int32    |  10.1  |    9.9   |    1411.2 
      torch.int64    |  10.5  |   10.5   |    3520.9 
      torch.float16  |  10.4  |   10.2   |     712.1 
      torch.float32  |  10.1  |   10.0   |    1411.3

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D30545296

There was a really grody conditional and multiple functors and it turns out none of it is necessary: the internal computation type for all variations of add is always the same (accscalar_t), and so all you need to do is fix the argument types of the functor to be the internal computation test, and the preexisting machinery will insert the conversions as necessary. There is one big problem with this, however: this disables launch_vectorized_kernel for the add kernel. This is because we advertise the kernel as taking accscalar_t argument, but scalar_t is what is in the tensor, so this triggers the dynamic_casting codepath in CUDALoops.cuh. I think it is possible to vectorize with casts but it will be a more involved change. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

facebook-github-bot · 2021-08-24T20:21:42Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/63884
📄 Preview docs built from this PR

💊 CI failures summary and remediations

As of commit 167c9ab (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

There was a really grody conditional and multiple functors and it turns out none of it is necessary: the internal computation type for all variations of add is always the same (accscalar_t), and so all you need to do is fix the argument types of the functor to be the internal computation test, and the preexisting machinery will insert the conversions as necessary. There is one big problem with this, however: this disables launch_vectorized_kernel for the add kernel. This is because we advertise the kernel as taking accscalar_t argument, but scalar_t is what is in the tensor, so this triggers the dynamic_casting codepath in CUDALoops.cuh. I think it is possible to vectorize with casts but it will be a more involved change. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 8d14939 Pull Request resolved: #63884

ngimel · 2021-08-24T20:54:38Z

dynamic_casting codepath comes with what's close to catastrophic perf penalty (close to 2x). cc @zasdfgbnm

zasdfgbnm · 2021-08-24T21:36:18Z

Yes, vectorization is critical for Pascal and Ampere to achieve high memory bandwidth for small data types like half, bfloat16, etc.

ezyang · 2021-08-25T02:02:10Z

dynamic_casting codepath comes with what's close to catastrophic perf penalty (close to 2x)

@ngimel I kind of feel like it should be possible to optimize this. We're memory bound right? It shouldn't be that expensive to work out the correct cast...

… it" See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: ecae09a Pull Request resolved: #63884

pytorch-probot · 2021-08-25T03:42:33Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/167c9ab95a06ab74e58af4b1b61acbaf8c7fea2f/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.8-gcc9-coverage	`ciflow/all`, `ciflow/coverage`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda10.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

ezyang · 2021-08-25T03:43:31Z

@ngimel @zasdfgbnm I came up with an updated strategy which I think actually works without perf hit. Tomorrow I'll benchmark and make sure for certain; code here has been updated.

ezyang · 2021-08-25T17:00:32Z

oops but it doesn't work LOL

ezyang · 2021-08-25T17:01:10Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2021-08-25T17:10:40Z

Turns out to have been a simple brain-o

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]

ezyang · 2021-08-25T17:13:15Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 93578a1 Pull Request resolved: #63884

ezyang · 2021-08-25T17:25:01Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Benchmark script: ``` import torch import torch.utils.benchmark as benchmark results = [] for dtype in [torch.half, torch.float]: for size in [1000, 1000000, 1000000000]: print(f'{dtype} {size}') results.append( benchmark.Timer( stmt='x.add(y)', globals={ 'x': torch.randn(size, dtype=dtype, device='cuda'), 'y': torch.randn(size, dtype=dtype, device='cuda') }, label='perf', sub_label=str(dtype), description=str(size), ).blocked_autorange(min_run_time=1) ) compare = benchmark.Compare(results) compare.print() ``` Quadro GP100 Using the old gpu_kernel_with_scalars: ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 8.7 | 13.3 | 10796.9 torch.float32 | 8.5 | 24.1 | 21651.8 ``` Quadro GP100 Using the new gpu_kernel_with_scalars (no acc): ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 8.9 | 13.5 | 10794.7 torch.float32 | 8.8 | 24.0 | 21650.5 Times are in microseconds (us). ``` Quadro GP100 Using the new acc_gpu_kernel_with_scalars: ``` | 1000 | 1000000 | 1000000000 1 threads: -------------------------------------------- torch.float16 | 7.7 | 13.4 | 10795.7 torch.float32 | 7.8 | 23.9 | 21651.6 Times are in microseconds (us). ``` V100 old code: ``` | 100 | 100000 | 100000000 1 threads: ------------------------------------------ torch.int8 | 10.1 | 9.5 | 554.2 torch.int32 | 9.4 | 9.3 | 1411.2 torch.int64 | 9.4 | 9.7 | 3521.1 torch.float16 | 9.5 | 9.5 | 712.1 torch.float32 | 9.6 | 9.6 | 1410.7 ``` V100 new code: ``` | 100 | 100000 | 100000000 1 threads: ------------------------------------------ torch.int8 | 10.0 | 9.6 | 554.1 torch.int32 | 10.1 | 9.9 | 1411.2 torch.int64 | 10.5 | 10.5 | 3520.9 torch.float16 | 10.4 | 10.2 | 712.1 torch.float32 | 10.1 | 10.0 | 1411.3 ``` Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296) [ghstack-poisoned]

ezyang · 2021-08-26T14:44:03Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel · 2021-08-26T17:19:16Z

Can you check if binary size has changed?

ezyang · 2021-08-31T02:09:41Z

In fact, binary size goes down:

# BEFORE
(/scratch/ezyang/pytorch-tmp-env) ezyang@devfair040:/scratch/ezyang/pytorch-tmp$ ls -l ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o
-rw-rw-r-- 1 ezyang ezyang 1944728 Aug 30 18:57 ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o
# AFTER
(/scratch/ezyang/pytorch-tmp-env) ezyang@devfair040:/scratch/ezyang/pytorch-tmp$ ls -l ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o
-rw-rw-r-- 1 ezyang ezyang 1493312 Aug 30 19:07 ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o

facebook-github-bot · 2021-08-31T02:11:37Z

@ezyang merged this pull request in ffc2612.

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> ghstack-source-id: 6715caa Pull Request resolved: pytorch#63884

pytorch-probot bot added the ciflow/default label Aug 24, 2021

pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 24, 2021

facebook-github-bot added the cla signed label Aug 24, 2021

ezyang requested a review from ngimel August 24, 2021 20:40

Update on "Add gpu_kernel_with_scalars_and_upcast and port add to use…

6ce6d2f

… it" See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

ezyang changed the title ~~[POC] Greatly simplify CUDA add kernel~~ Add gpu_kernel_with_scalars_and_upcast and port add to use it Aug 25, 2021

pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 25, 2021

Update on "Add gpu_kernel_with_scalars_and_upcast and port add to use…

6365b81

… it" See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302 for explanation of what's going on here. Signed-off-by: Edward Z. Yang <ezyang@fb.com> [ghstack-poisoned]

pytorch-probot bot assigned pytorchbot Aug 25, 2021

pytorch-probot bot unassigned pytorchbot Aug 25, 2021

ezyang requested a review from zasdfgbnm August 25, 2021 03:43

ezyang changed the title ~~Add gpu_kernel_with_scalars_and_upcast and port add to use it~~ Add acc_gpu_kernel_with_scalars and port add to use it Aug 25, 2021

pytorch-probot bot assigned pytorchbot Aug 25, 2021

pytorch-probot bot unassigned pytorchbot Aug 25, 2021

pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 25, 2021

ezyang requested a review from mcarilli August 25, 2021 21:01

ezyang mentioned this pull request Aug 25, 2021

Deify opmath_t into a first class concept #63985

Closed

ezyang mentioned this pull request Aug 26, 2021

Convert mul to use opmath_gpu_kernel_with_scalars #64019

Closed

pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 26, 2021

ezyang changed the title ~~Add acc_gpu_kernel_with_scalars and port add to use it~~ Add opmath_gpu_kernel_with_scalars and port add to use it Aug 26, 2021

ngimel approved these changes Aug 26, 2021

View reviewed changes

facebook-github-bot closed this in ffc2612 Aug 31, 2021

facebook-github-bot added the Merged label Aug 31, 2021

facebook-github-bot deleted the gh/ezyang/1065/head branch September 3, 2021 14:19

ngimel mentioned this pull request Oct 21, 2021

A modest proposal: delete arithmetic overloads from c10::Half #64023

Open

Add opmath_gpu_kernel_with_scalars and port add to use it #63884

Add opmath_gpu_kernel_with_scalars and port add to use it #63884

Uh oh!

Conversation

ezyang commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

ngimel commented Aug 24, 2021

Uh oh!

zasdfgbnm commented Aug 24, 2021

Uh oh!

ezyang commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

ezyang commented Aug 25, 2021

Uh oh!

ezyang commented Aug 25, 2021

Uh oh!

ezyang commented Aug 25, 2021

Uh oh!

ezyang commented Aug 25, 2021

Uh oh!

ezyang commented Aug 25, 2021

Uh oh!

ezyang commented Aug 25, 2021

Uh oh!

ezyang commented Aug 26, 2021

Uh oh!

ngimel commented Aug 26, 2021

Uh oh!

ezyang commented Aug 31, 2021

Uh oh!

facebook-github-bot commented Aug 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ezyang commented Aug 24, 2021 •

edited

Loading

facebook-github-bot commented Aug 24, 2021 •

edited

Loading

ezyang commented Aug 25, 2021 •

edited

Loading

pytorch-probot bot commented Aug 25, 2021 •

edited

Loading