KEMBAR78
Add opmath_gpu_kernel_with_scalars and port add to use it by ezyang · Pull Request #63884 · pytorch/pytorch · GitHub
Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented Aug 24, 2021

Stack from ghstack:

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here. (NB: the discuss post uses accscalar_t as the compute type, but we have since split it off into its own type opmath_t.)

Benchmark script:

import torch
import torch.utils.benchmark as benchmark
        
results = []
for dtype in [torch.half, torch.float]:
    for size in [1000, 1000000, 1000000000]:
        print(f'{dtype} {size}')
        results.append(
            benchmark.Timer(
                stmt='x.add(y)',
                globals={
                    'x': torch.randn(size, dtype=dtype, device='cuda'),
                    'y': torch.randn(size, dtype=dtype, device='cuda')
                },
                label='perf',
                sub_label=str(dtype),
                description=str(size),
            ).blocked_autorange(min_run_time=1)
        )

compare = benchmark.Compare(results)
compare.print()

Quadro GP100 Using the old gpu_kernel_with_scalars:

                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  8.7   |    13.3   |   10796.9  
      torch.float32  |  8.5   |    24.1   |   21651.8  

Quadro GP100 Using the new gpu_kernel_with_scalars (no acc):

                     |  1000  |  1000000  |  1000000000                                                            
1 threads: --------------------------------------------                                                            
      torch.float16  |  8.9   |    13.5   |   10794.7                                                              
      torch.float32  |  8.8   |    24.0   |   21650.5                                                                                                                                                                                 Times are in microseconds (us). 

Quadro GP100 Using the new acc_gpu_kernel_with_scalars:

                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  7.7   |    13.4   |   10795.7  
      torch.float32  |  7.8   |    23.9   |   21651.6                                                                                                                                                                 Times are in microseconds (us). 

V100 old code:

                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.1  |   9.5    |     554.2 
      torch.int32    |   9.4  |   9.3    |    1411.2 
      torch.int64    |   9.4  |   9.7    |    3521.1 
      torch.float16  |   9.5  |   9.5    |     712.1 
      torch.float32  |   9.6  |   9.6    |    1410.7 

V100 new code:

                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.0  |    9.6   |     554.1 
      torch.int32    |  10.1  |    9.9   |    1411.2 
      torch.int64    |  10.5  |   10.5   |    3520.9 
      torch.float16  |  10.4  |   10.2   |     712.1 
      torch.float32  |  10.1  |   10.0   |    1411.3 

Signed-off-by: Edward Z. Yang ezyang@fb.com

Differential Revision: D30545296

There was a really grody conditional and multiple functors and
it turns out none of it is necessary: the internal computation
type for all variations of add is always the same (accscalar_t),
and so all you need to do is fix the argument types of the functor
to be the internal computation test, and the preexisting machinery
will insert the conversions as necessary.

There is one big problem with this, however: this disables
launch_vectorized_kernel for the add kernel.  This is because
we advertise the kernel as taking accscalar_t argument, but
scalar_t is what is in the tensor, so this triggers the dynamic_casting
codepath in CUDALoops.cuh.  I think it is possible to vectorize with
casts but it will be a more involved change.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Aug 24, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 167c9ab (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ezyang added a commit that referenced this pull request Aug 24, 2021
There was a really grody conditional and multiple functors and
it turns out none of it is necessary: the internal computation
type for all variations of add is always the same (accscalar_t),
and so all you need to do is fix the argument types of the functor
to be the internal computation test, and the preexisting machinery
will insert the conversions as necessary.

There is one big problem with this, however: this disables
launch_vectorized_kernel for the add kernel.  This is because
we advertise the kernel as taking accscalar_t argument, but
scalar_t is what is in the tensor, so this triggers the dynamic_casting
codepath in CUDALoops.cuh.  I think it is possible to vectorize with
casts but it will be a more involved change.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 8d14939
Pull Request resolved: #63884
@ezyang ezyang requested a review from ngimel August 24, 2021 20:40
@ngimel
Copy link
Collaborator

ngimel commented Aug 24, 2021

dynamic_casting codepath comes with what's close to catastrophic perf penalty (close to 2x). cc @zasdfgbnm

@zasdfgbnm
Copy link
Collaborator

Yes, vectorization is critical for Pascal and Ampere to achieve high memory bandwidth for small data types like half, bfloat16, etc.

@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

dynamic_casting codepath comes with what's close to catastrophic perf penalty (close to 2x)

@ngimel I kind of feel like it should be possible to optimize this. We're memory bound right? It shouldn't be that expensive to work out the correct cast...

… it"

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
@ezyang ezyang changed the title [POC] Greatly simplify CUDA add kernel Add gpu_kernel_with_scalars_and_upcast and port add to use it Aug 25, 2021
@pytorch-probot pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 25, 2021
… it"

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 25, 2021
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: ecae09a
Pull Request resolved: #63884
@pytorch-probot
Copy link

pytorch-probot bot commented Aug 25, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/167c9ab95a06ab74e58af4b1b61acbaf8c7fea2f/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.8-gcc9-coverage ciflow/all, ciflow/coverage, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda10.1-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@ezyang ezyang requested a review from zasdfgbnm August 25, 2021 03:43
@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

@ngimel @zasdfgbnm I came up with an updated strategy which I think actually works without perf hit. Tomorrow I'll benchmark and make sure for certain; code here has been updated.

@ezyang ezyang changed the title Add gpu_kernel_with_scalars_and_upcast and port add to use it Add acc_gpu_kernel_with_scalars and port add to use it Aug 25, 2021
@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

oops but it doesn't work LOL

@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

Turns out to have been a simple brain-o

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296)

[ghstack-poisoned]
@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296)

[ghstack-poisoned]
ezyang added a commit that referenced this pull request Aug 25, 2021
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 93578a1
Pull Request resolved: #63884
@pytorch-probot pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 25, 2021
@ezyang
Copy link
Contributor Author

ezyang commented Aug 25, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Benchmark script:

```
import torch
import torch.utils.benchmark as benchmark
        
results = []
for dtype in [torch.half, torch.float]:
    for size in [1000, 1000000, 1000000000]:
        print(f'{dtype} {size}')
        results.append(
            benchmark.Timer(
                stmt='x.add(y)',
                globals={
                    'x': torch.randn(size, dtype=dtype, device='cuda'),
                    'y': torch.randn(size, dtype=dtype, device='cuda')
                },
                label='perf',
                sub_label=str(dtype),
                description=str(size),
            ).blocked_autorange(min_run_time=1)
        )

compare = benchmark.Compare(results)
compare.print()
```

Quadro GP100 Using the old gpu_kernel_with_scalars:
```
                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  8.7   |    13.3   |   10796.9  
      torch.float32  |  8.5   |    24.1   |   21651.8  
```

Quadro GP100 Using the new gpu_kernel_with_scalars (no acc):

```
                     |  1000  |  1000000  |  1000000000                                                            
1 threads: --------------------------------------------                                                            
      torch.float16  |  8.9   |    13.5   |   10794.7                                                              
      torch.float32  |  8.8   |    24.0   |   21650.5                                                                                                                                                                                 Times are in microseconds (us). 
```

Quadro GP100 Using the new acc_gpu_kernel_with_scalars:

```
                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  7.7   |    13.4   |   10795.7  
      torch.float32  |  7.8   |    23.9   |   21651.6                                                                                                                                                                 Times are in microseconds (us). 
```

V100 old code:

```
                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.1  |   9.5    |     554.2 
      torch.int32    |   9.4  |   9.3    |    1411.2 
      torch.int64    |   9.4  |   9.7    |    3521.1 
      torch.float16  |   9.5  |   9.5    |     712.1 
      torch.float32  |   9.6  |   9.6    |    1410.7 
```

V100 new code:

```
                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.0  |    9.6   |     554.1 
      torch.int32    |  10.1  |    9.9   |    1411.2 
      torch.int64    |  10.5  |   10.5   |    3520.9 
      torch.float16  |  10.4  |   10.2   |     712.1 
      torch.float32  |  10.1  |   10.0   |    1411.3 
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296)

[ghstack-poisoned]
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Benchmark script:

```
import torch
import torch.utils.benchmark as benchmark
        
results = []
for dtype in [torch.half, torch.float]:
    for size in [1000, 1000000, 1000000000]:
        print(f'{dtype} {size}')
        results.append(
            benchmark.Timer(
                stmt='x.add(y)',
                globals={
                    'x': torch.randn(size, dtype=dtype, device='cuda'),
                    'y': torch.randn(size, dtype=dtype, device='cuda')
                },
                label='perf',
                sub_label=str(dtype),
                description=str(size),
            ).blocked_autorange(min_run_time=1)
        )

compare = benchmark.Compare(results)
compare.print()
```

Quadro GP100 Using the old gpu_kernel_with_scalars:
```
                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  8.7   |    13.3   |   10796.9  
      torch.float32  |  8.5   |    24.1   |   21651.8  
```

Quadro GP100 Using the new gpu_kernel_with_scalars (no acc):

```
                     |  1000  |  1000000  |  1000000000                                                            
1 threads: --------------------------------------------                                                            
      torch.float16  |  8.9   |    13.5   |   10794.7                                                              
      torch.float32  |  8.8   |    24.0   |   21650.5                                                                                                                                                                                 Times are in microseconds (us). 
```

Quadro GP100 Using the new acc_gpu_kernel_with_scalars:

```
                     |  1000  |  1000000  |  1000000000
1 threads: --------------------------------------------
      torch.float16  |  7.7   |    13.4   |   10795.7  
      torch.float32  |  7.8   |    23.9   |   21651.6                                                                                                                                                                 Times are in microseconds (us). 
```

V100 old code:

```
                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.1  |   9.5    |     554.2 
      torch.int32    |   9.4  |   9.3    |    1411.2 
      torch.int64    |   9.4  |   9.7    |    3521.1 
      torch.float16  |   9.5  |   9.5    |     712.1 
      torch.float32  |   9.6  |   9.6    |    1410.7 
```

V100 new code:

```
                     |  100   |  100000  |  100000000
1 threads: ------------------------------------------
      torch.int8     |  10.0  |    9.6   |     554.1 
      torch.int32    |  10.1  |    9.9   |    1411.2 
      torch.int64    |  10.5  |   10.5   |    3520.9 
      torch.float16  |  10.4  |   10.2   |     712.1 
      torch.float32  |  10.1  |   10.0   |    1411.3 
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D30545296](https://our.internmc.facebook.com/intern/diff/D30545296)

[ghstack-poisoned]
@pytorch-probot pytorch-probot bot assigned pytorchbot and unassigned pytorchbot Aug 26, 2021
@ezyang ezyang changed the title Add acc_gpu_kernel_with_scalars and port add to use it Add opmath_gpu_kernel_with_scalars and port add to use it Aug 26, 2021
@ezyang
Copy link
Contributor Author

ezyang commented Aug 26, 2021

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ngimel
Copy link
Collaborator

ngimel commented Aug 26, 2021

Can you check if binary size has changed?

@ezyang
Copy link
Contributor Author

ezyang commented Aug 31, 2021

In fact, binary size goes down:

# BEFORE
(/scratch/ezyang/pytorch-tmp-env) ezyang@devfair040:/scratch/ezyang/pytorch-tmp$ ls -l ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o
-rw-rw-r-- 1 ezyang ezyang 1944728 Aug 30 18:57 ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o
# AFTER
(/scratch/ezyang/pytorch-tmp-env) ezyang@devfair040:/scratch/ezyang/pytorch-tmp$ ls -l ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o
-rw-rw-r-- 1 ezyang ezyang 1493312 Aug 30 19:07 ./build/caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch_cuda_generated_BinaryAddSubKernel.cu.o

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in ffc2612.

ezyang added a commit to ezyang/pytorch that referenced this pull request Sep 1, 2021
See https://dev-discuss.pytorch.org/t/cuda-loops-case-study-code-generation-vs-templates/302
for explanation of what's going on here.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 6715caa
Pull Request resolved: pytorch#63884
@facebook-github-bot facebook-github-bot deleted the gh/ezyang/1065/head branch September 3, 2021 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants