KEMBAR78
[inductor] Improved grid_sampler_2d decomposition for cuda by vfdev-5 · Pull Request #104710 · pytorch/pytorch · GitHub
Skip to content

Conversation

@vfdev-5
Copy link
Contributor

@vfdev-5 vfdev-5 commented Jul 6, 2023

Stack from ghstack (oldest at bottom):

Description:

  • Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:

  • speed-up on cuda (~x5) and cpu (~x2) for bicubic mode
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)         

Times are in microseconds (us).

Source

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 6, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104710

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74b97a3 with merge base bcda859 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vfdev-5 added a commit that referenced this pull request Jul 6, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: b444b36
Pull Request resolved: #104710
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bicubic and bilinear modes (bilinear may be some noise)



```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git421fe7b) PR-afgg" and "Compiled (2.1.0a0+gitd3ba890) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+gitd3ba890) Nightly  |  Speed-up PR vs Nightly  |  Eager (2.1.0a0+gitd3ba890) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         7.489 (+-0.287)         |          16.801 (+-0.138)          |             13.330 (+-0.028)            |     0.793 (+-0.000)      |           7.522 (+-0.044)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         7.587 (+-0.031)         |          12.494 (+-0.066)          |             16.024 (+-0.128)            |     1.283 (+-0.000)      |           7.530 (+-0.086)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         7.808 (+-0.038)         |          20.410 (+-1.616)          |             13.149 (+-0.200)            |     0.644 (+-0.000)      |           7.672 (+-0.124)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         7.989 (+-0.034)         |          12.130 (+-0.033)          |             15.698 (+-0.118)            |     1.294 (+-0.000)      |           7.745 (+-0.078)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         4.593 (+-0.030)         |          5.848 (+-0.012)           |             6.471 (+-0.087)             |     1.106 (+-0.000)      |           4.388 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.190 (+-0.008)         |          5.979 (+-0.008)           |             6.490 (+-0.069)             |     1.085 (+-0.000)      |           4.397 (+-0.021)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         4.582 (+-0.011)         |          5.465 (+-0.024)           |             6.464 (+-0.193)             |     1.183 (+-0.000)      |           4.793 (+-0.024)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.403 (+-0.004)         |          5.866 (+-0.007)           |             6.688 (+-0.196)             |     1.140 (+-0.000)      |           4.370 (+-0.013)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.106 (+-0.138)        |         104.156 (+-3.881)          |             64.199 (+-0.402)            |     0.616 (+-0.000)      |           26.645 (+-0.173)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.192 (+-0.141)        |         102.890 (+-1.249)          |             71.674 (+-0.679)            |     0.697 (+-0.000)      |           26.498 (+-0.220)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         25.752 (+-0.133)        |          99.068 (+-3.399)          |             66.274 (+-0.172)            |     0.669 (+-0.000)      |           26.758 (+-0.081)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.366 (+-0.082)        |         103.052 (+-1.758)          |             72.297 (+-0.398)            |     0.702 (+-0.000)      |           26.535 (+-0.145)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+gitd3ba890) Nightly  |  Speed-up PR vs Nightly  |  Eager (2.1.0a0+gitd3ba890) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.257 (+-0.462)        |         125.216 (+-0.401)          |            136.807 (+-0.636)            |     1.093 (+-0.000)      |           86.542 (+-0.393)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         87.649 (+-0.440)        |         125.382 (+-3.365)          |            136.184 (+-0.356)            |     1.086 (+-0.000)      |           86.255 (+-0.361)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        111.428 (+-0.511)        |         108.644 (+-0.338)          |            221.696 (+-0.820)            |     2.041 (+-0.000)      |          110.533 (+-0.721)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.717 (+-0.458)        |         108.719 (+-0.427)          |            222.547 (+-0.218)            |     2.047 (+-0.000)      |          110.922 (+-0.567)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         77.541 (+-0.317)        |         108.937 (+-0.301)          |            142.400 (+-0.258)            |     1.307 (+-0.000)      |           76.351 (+-0.443)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.313 (+-0.341)        |         108.872 (+-0.421)          |            142.147 (+-0.709)            |     1.306 (+-0.000)      |           76.435 (+-0.390)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        110.669 (+-0.475)        |         109.328 (+-0.345)          |            178.589 (+-0.474)            |     1.634 (+-0.000)      |          110.797 (+-0.527)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.605 (+-0.521)        |         109.049 (+-0.401)          |            178.601 (+-0.382)            |     1.638 (+-0.000)      |          109.887 (+-0.417)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.652 (+-0.097)        |         333.377 (+-0.011)          |            1800.770 (+-0.552)           |     5.402 (+-0.000)      |           92.892 (+-0.220)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.280 (+-0.373)        |         334.606 (+-0.026)          |            1463.572 (+-0.644)           |     4.374 (+-0.000)      |           92.596 (+-0.489)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.864 (+-0.533)        |         333.195 (+-0.016)          |            1806.567 (+-0.456)           |     5.422 (+-0.000)      |          110.970 (+-0.444)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.047 (+-0.700)        |         334.676 (+-0.028)          |            1470.514 (+-1.586)           |     4.394 (+-0.000)      |          110.857 (+-0.506)         

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230706-162617-affine-grid-sampler-PR-vs-Nightly-speedup.md)

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Jul 6, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: ab74cf4
Pull Request resolved: #104710
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bicubic and bilinear modes (bilinear may be some noise)



```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git421fe7b) PR-afgg" and "Compiled (2.1.0a0+gitd3ba890) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+gitd3ba890) Nightly  |  Speed-up PR vs Nightly  |  Eager (2.1.0a0+gitd3ba890) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         7.489 (+-0.287)         |          16.801 (+-0.138)          |             13.330 (+-0.028)            |     0.793 (+-0.000)      |           7.522 (+-0.044)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         7.587 (+-0.031)         |          12.494 (+-0.066)          |             16.024 (+-0.128)            |     1.283 (+-0.000)      |           7.530 (+-0.086)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         7.808 (+-0.038)         |          20.410 (+-1.616)          |             13.149 (+-0.200)            |     0.644 (+-0.000)      |           7.672 (+-0.124)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         7.989 (+-0.034)         |          12.130 (+-0.033)          |             15.698 (+-0.118)            |     1.294 (+-0.000)      |           7.745 (+-0.078)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         4.593 (+-0.030)         |          5.848 (+-0.012)           |             6.471 (+-0.087)             |     1.106 (+-0.000)      |           4.388 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.190 (+-0.008)         |          5.979 (+-0.008)           |             6.490 (+-0.069)             |     1.085 (+-0.000)      |           4.397 (+-0.021)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         4.582 (+-0.011)         |          5.465 (+-0.024)           |             6.464 (+-0.193)             |     1.183 (+-0.000)      |           4.793 (+-0.024)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.403 (+-0.004)         |          5.866 (+-0.007)           |             6.688 (+-0.196)             |     1.140 (+-0.000)      |           4.370 (+-0.013)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.106 (+-0.138)        |         104.156 (+-3.881)          |             64.199 (+-0.402)            |     0.616 (+-0.000)      |           26.645 (+-0.173)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.192 (+-0.141)        |         102.890 (+-1.249)          |             71.674 (+-0.679)            |     0.697 (+-0.000)      |           26.498 (+-0.220)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         25.752 (+-0.133)        |          99.068 (+-3.399)          |             66.274 (+-0.172)            |     0.669 (+-0.000)      |           26.758 (+-0.081)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.366 (+-0.082)        |         103.052 (+-1.758)          |             72.297 (+-0.398)            |     0.702 (+-0.000)      |           26.535 (+-0.145)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+git421fe7b) PR  |  Compiled (2.1.0a0+gitd3ba890) Nightly  |  Speed-up PR vs Nightly  |  Eager (2.1.0a0+gitd3ba890) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.257 (+-0.462)        |         125.216 (+-0.401)          |            136.807 (+-0.636)            |     1.093 (+-0.000)      |           86.542 (+-0.393)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         87.649 (+-0.440)        |         125.382 (+-3.365)          |            136.184 (+-0.356)            |     1.086 (+-0.000)      |           86.255 (+-0.361)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        111.428 (+-0.511)        |         108.644 (+-0.338)          |            221.696 (+-0.820)            |     2.041 (+-0.000)      |          110.533 (+-0.721)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.717 (+-0.458)        |         108.719 (+-0.427)          |            222.547 (+-0.218)            |     2.047 (+-0.000)      |          110.922 (+-0.567)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         77.541 (+-0.317)        |         108.937 (+-0.301)          |            142.400 (+-0.258)            |     1.307 (+-0.000)      |           76.351 (+-0.443)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.313 (+-0.341)        |         108.872 (+-0.421)          |            142.147 (+-0.709)            |     1.306 (+-0.000)      |           76.435 (+-0.390)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        110.669 (+-0.475)        |         109.328 (+-0.345)          |            178.589 (+-0.474)            |     1.634 (+-0.000)      |          110.797 (+-0.527)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.605 (+-0.521)        |         109.049 (+-0.401)          |            178.601 (+-0.382)            |     1.638 (+-0.000)      |          109.887 (+-0.417)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.652 (+-0.097)        |         333.377 (+-0.011)          |            1800.770 (+-0.552)           |     5.402 (+-0.000)      |           92.892 (+-0.220)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.280 (+-0.373)        |         334.606 (+-0.026)          |            1463.572 (+-0.644)           |     4.374 (+-0.000)      |           92.596 (+-0.489)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.864 (+-0.533)        |         333.195 (+-0.016)          |            1806.567 (+-0.456)           |     5.422 (+-0.000)      |          110.970 (+-0.444)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.047 (+-0.700)        |         334.676 (+-0.028)          |            1470.514 (+-1.586)           |     4.394 (+-0.000)      |          110.857 (+-0.506)         

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230706-162617-affine-grid-sampler-PR-vs-Nightly-speedup.md)

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Jul 11, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: 38d904a
Pull Request resolved: #104710
@vfdev-5 vfdev-5 requested a review from lezcano July 11, 2023 10:14
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bilinear mode, CF


```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         8.466 (+-0.072)         |          15.557 (+-0.054)          |             13.292 (+-0.113)            |     0.854 (+-0.000)      |           7.567 (+-0.037)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         8.685 (+-0.035)         |          11.384 (+-0.024)          |             15.798 (+-0.036)            |     1.388 (+-0.000)      |           7.489 (+-0.114)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         8.572 (+-0.085)         |          15.867 (+-0.046)          |             12.964 (+-0.050)            |     0.817 (+-0.000)      |           7.623 (+-0.126)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         8.834 (+-0.169)         |          11.447 (+-0.030)          |             15.386 (+-0.061)            |     1.344 (+-0.000)      |           7.647 (+-0.030)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         5.039 (+-0.011)         |          4.569 (+-0.016)           |             6.383 (+-0.038)             |     1.397 (+-0.000)      |           4.504 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.326 (+-0.008)         |          4.867 (+-0.013)           |             6.393 (+-0.067)             |     1.314 (+-0.000)      |           4.270 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         5.085 (+-0.031)         |          4.220 (+-0.006)           |             6.426 (+-0.126)             |     1.523 (+-0.000)      |           4.780 (+-0.204)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.411 (+-0.004)         |          4.619 (+-0.005)           |             6.283 (+-0.114)             |     1.360 (+-0.000)      |           4.315 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.061 (+-0.083)        |          28.477 (+-0.026)          |             63.423 (+-0.464)            |     2.227 (+-0.000)      |           25.943 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.358 (+-0.086)        |          30.660 (+-0.328)          |             71.692 (+-0.282)            |     2.338 (+-0.000)      |           26.143 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         26.172 (+-0.124)        |          28.072 (+-0.039)          |             65.312 (+-0.478)            |     2.327 (+-0.000)      |           25.810 (+-0.344)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.522 (+-0.065)        |          30.480 (+-0.060)          |             71.560 (+-0.606)            |     2.348 (+-0.000)      |           26.105 (+-1.344)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.726 (+-0.344)        |          88.732 (+-0.194)          |            141.983 (+-0.551)            |     1.600 (+-0.000)      |           89.228 (+-0.300)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         88.873 (+-0.366)        |          88.690 (+-0.196)          |            141.351 (+-0.456)            |     1.594 (+-0.000)      |           89.257 (+-0.326)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        110.747 (+-0.742)        |          69.262 (+-0.174)          |            228.701 (+-8.460)            |     3.302 (+-0.000)      |          112.709 (+-0.746)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.729 (+-0.421)        |          68.543 (+-0.096)          |            230.542 (+-0.656)            |     3.363 (+-0.000)      |          112.994 (+-0.644)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         78.248 (+-0.323)        |          68.913 (+-0.227)          |            148.836 (+-0.244)            |     2.160 (+-0.000)      |           79.004 (+-0.973)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.898 (+-0.362)        |          68.819 (+-0.218)          |            149.036 (+-0.566)            |     2.166 (+-0.000)      |           78.681 (+-0.309)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        111.041 (+-0.404)        |          69.329 (+-0.100)          |            184.097 (+-0.673)            |     2.655 (+-0.000)      |          113.252 (+-0.585)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.903 (+-0.391)        |          70.003 (+-0.271)          |            183.848 (+-1.566)            |     2.626 (+-0.000)      |          113.787 (+-0.943)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.796 (+-0.536)        |          69.966 (+-0.218)          |            1793.246 (+-0.481)           |     25.630 (+-0.000)     |           92.416 (+-0.072)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.744 (+-0.439)        |          87.140 (+-0.101)          |            1457.581 (+-0.599)           |     16.727 (+-0.000)     |           92.510 (+-0.557)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.599 (+-0.331)        |          70.036 (+-0.280)          |            1800.172 (+-0.422)           |     25.704 (+-0.000)     |          112.876 (+-0.498)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        111.289 (+-0.346)        |          86.682 (+-0.020)          |            1463.788 (+-0.566)           |     16.887 (+-0.000)     |          112.987 (+-0.358)         

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md)

[ghstack-poisoned]
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bilinear mode, CF


```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         8.466 (+-0.072)         |          15.557 (+-0.054)          |             13.292 (+-0.113)            |     0.854 (+-0.000)      |           7.567 (+-0.037)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         8.685 (+-0.035)         |          11.384 (+-0.024)          |             15.798 (+-0.036)            |     1.388 (+-0.000)      |           7.489 (+-0.114)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         8.572 (+-0.085)         |          15.867 (+-0.046)          |             12.964 (+-0.050)            |     0.817 (+-0.000)      |           7.623 (+-0.126)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         8.834 (+-0.169)         |          11.447 (+-0.030)          |             15.386 (+-0.061)            |     1.344 (+-0.000)      |           7.647 (+-0.030)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         5.039 (+-0.011)         |          4.569 (+-0.016)           |             6.383 (+-0.038)             |     1.397 (+-0.000)      |           4.504 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.326 (+-0.008)         |          4.867 (+-0.013)           |             6.393 (+-0.067)             |     1.314 (+-0.000)      |           4.270 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         5.085 (+-0.031)         |          4.220 (+-0.006)           |             6.426 (+-0.126)             |     1.523 (+-0.000)      |           4.780 (+-0.204)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.411 (+-0.004)         |          4.619 (+-0.005)           |             6.283 (+-0.114)             |     1.360 (+-0.000)      |           4.315 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.061 (+-0.083)        |          28.477 (+-0.026)          |             63.423 (+-0.464)            |     2.227 (+-0.000)      |           25.943 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.358 (+-0.086)        |          30.660 (+-0.328)          |             71.692 (+-0.282)            |     2.338 (+-0.000)      |           26.143 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         26.172 (+-0.124)        |          28.072 (+-0.039)          |             65.312 (+-0.478)            |     2.327 (+-0.000)      |           25.810 (+-0.344)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.522 (+-0.065)        |          30.480 (+-0.060)          |             71.560 (+-0.606)            |     2.348 (+-0.000)      |           26.105 (+-1.344)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.726 (+-0.344)        |          88.732 (+-0.194)          |            141.983 (+-0.551)            |     1.600 (+-0.000)      |           89.228 (+-0.300)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         88.873 (+-0.366)        |          88.690 (+-0.196)          |            141.351 (+-0.456)            |     1.594 (+-0.000)      |           89.257 (+-0.326)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        110.747 (+-0.742)        |          69.262 (+-0.174)          |            228.701 (+-8.460)            |     3.302 (+-0.000)      |          112.709 (+-0.746)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.729 (+-0.421)        |          68.543 (+-0.096)          |            230.542 (+-0.656)            |     3.363 (+-0.000)      |          112.994 (+-0.644)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         78.248 (+-0.323)        |          68.913 (+-0.227)          |            148.836 (+-0.244)            |     2.160 (+-0.000)      |           79.004 (+-0.973)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.898 (+-0.362)        |          68.819 (+-0.218)          |            149.036 (+-0.566)            |     2.166 (+-0.000)      |           78.681 (+-0.309)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        111.041 (+-0.404)        |          69.329 (+-0.100)          |            184.097 (+-0.673)            |     2.655 (+-0.000)      |          113.252 (+-0.585)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.903 (+-0.391)        |          70.003 (+-0.271)          |            183.848 (+-1.566)            |     2.626 (+-0.000)      |          113.787 (+-0.943)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.796 (+-0.536)        |          69.966 (+-0.218)          |            1793.246 (+-0.481)           |     25.630 (+-0.000)     |           92.416 (+-0.072)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.744 (+-0.439)        |          87.140 (+-0.101)          |            1457.581 (+-0.599)           |     16.727 (+-0.000)     |           92.510 (+-0.557)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.599 (+-0.331)        |          70.036 (+-0.280)          |            1800.172 (+-0.422)           |     25.704 (+-0.000)     |          112.876 (+-0.498)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        111.289 (+-0.346)        |          86.682 (+-0.020)          |            1463.788 (+-0.566)           |     16.887 (+-0.000)     |          112.987 (+-0.358)         

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md)

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Jul 11, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: 52d3638
Pull Request resolved: #104710
@vfdev-5 vfdev-5 marked this pull request as draft July 11, 2023 16:23
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Jul 11, 2023

Set this PR as draft, as perf for bicubic,cuda compiled function depends on output memory format:

      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.651 (+-0.390)        |          70.458 (+-0.018)        
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.567 (+-0.493)        |         1236.525 (+-0.301)       
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        113.788 (+-0.407)        |          68.739 (+-0.085)        
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        114.094 (+-0.303)        |         1235.920 (+-0.349)       

Times are in microseconds (us).

and aten grid sampler version does not respect the memory format: if input is channels last, output will be channels first

Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bilinear mode, CF


```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         8.466 (+-0.072)         |          15.557 (+-0.054)          |             13.292 (+-0.113)            |     0.854 (+-0.000)      |           7.567 (+-0.037)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         8.685 (+-0.035)         |          11.384 (+-0.024)          |             15.798 (+-0.036)            |     1.388 (+-0.000)      |           7.489 (+-0.114)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         8.572 (+-0.085)         |          15.867 (+-0.046)          |             12.964 (+-0.050)            |     0.817 (+-0.000)      |           7.623 (+-0.126)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         8.834 (+-0.169)         |          11.447 (+-0.030)          |             15.386 (+-0.061)            |     1.344 (+-0.000)      |           7.647 (+-0.030)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         5.039 (+-0.011)         |          4.569 (+-0.016)           |             6.383 (+-0.038)             |     1.397 (+-0.000)      |           4.504 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.326 (+-0.008)         |          4.867 (+-0.013)           |             6.393 (+-0.067)             |     1.314 (+-0.000)      |           4.270 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         5.085 (+-0.031)         |          4.220 (+-0.006)           |             6.426 (+-0.126)             |     1.523 (+-0.000)      |           4.780 (+-0.204)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.411 (+-0.004)         |          4.619 (+-0.005)           |             6.283 (+-0.114)             |     1.360 (+-0.000)      |           4.315 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.061 (+-0.083)        |          28.477 (+-0.026)          |             63.423 (+-0.464)            |     2.227 (+-0.000)      |           25.943 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.358 (+-0.086)        |          30.660 (+-0.328)          |             71.692 (+-0.282)            |     2.338 (+-0.000)      |           26.143 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         26.172 (+-0.124)        |          28.072 (+-0.039)          |             65.312 (+-0.478)            |     2.327 (+-0.000)      |           25.810 (+-0.344)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.522 (+-0.065)        |          30.480 (+-0.060)          |             71.560 (+-0.606)            |     2.348 (+-0.000)      |           26.105 (+-1.344)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.726 (+-0.344)        |          88.732 (+-0.194)          |            141.983 (+-0.551)            |     1.600 (+-0.000)      |           89.228 (+-0.300)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         88.873 (+-0.366)        |          88.690 (+-0.196)          |            141.351 (+-0.456)            |     1.594 (+-0.000)      |           89.257 (+-0.326)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        110.747 (+-0.742)        |          69.262 (+-0.174)          |            228.701 (+-8.460)            |     3.302 (+-0.000)      |          112.709 (+-0.746)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.729 (+-0.421)        |          68.543 (+-0.096)          |            230.542 (+-0.656)            |     3.363 (+-0.000)      |          112.994 (+-0.644)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         78.248 (+-0.323)        |          68.913 (+-0.227)          |            148.836 (+-0.244)            |     2.160 (+-0.000)      |           79.004 (+-0.973)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.898 (+-0.362)        |          68.819 (+-0.218)          |            149.036 (+-0.566)            |     2.166 (+-0.000)      |           78.681 (+-0.309)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        111.041 (+-0.404)        |          69.329 (+-0.100)          |            184.097 (+-0.673)            |     2.655 (+-0.000)      |          113.252 (+-0.585)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.903 (+-0.391)        |          70.003 (+-0.271)          |            183.848 (+-1.566)            |     2.626 (+-0.000)      |          113.787 (+-0.943)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.796 (+-0.536)        |          69.966 (+-0.218)          |            1793.246 (+-0.481)           |     25.630 (+-0.000)     |           92.416 (+-0.072)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.744 (+-0.439)        |          87.140 (+-0.101)          |            1457.581 (+-0.599)           |     16.727 (+-0.000)     |           92.510 (+-0.557)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.599 (+-0.331)        |          70.036 (+-0.280)          |            1800.172 (+-0.422)           |     25.704 (+-0.000)     |          112.876 (+-0.498)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        111.289 (+-0.346)        |          86.682 (+-0.020)          |            1463.788 (+-0.566)           |     16.887 (+-0.000)     |          112.987 (+-0.358)         

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md)

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Aug 1, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: f91d008
Pull Request resolved: #104710
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bilinear mode, CF


```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         8.466 (+-0.072)         |          15.557 (+-0.054)          |             13.292 (+-0.113)            |     0.854 (+-0.000)      |           7.567 (+-0.037)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         8.685 (+-0.035)         |          11.384 (+-0.024)          |             15.798 (+-0.036)            |     1.388 (+-0.000)      |           7.489 (+-0.114)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         8.572 (+-0.085)         |          15.867 (+-0.046)          |             12.964 (+-0.050)            |     0.817 (+-0.000)      |           7.623 (+-0.126)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         8.834 (+-0.169)         |          11.447 (+-0.030)          |             15.386 (+-0.061)            |     1.344 (+-0.000)      |           7.647 (+-0.030)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         5.039 (+-0.011)         |          4.569 (+-0.016)           |             6.383 (+-0.038)             |     1.397 (+-0.000)      |           4.504 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.326 (+-0.008)         |          4.867 (+-0.013)           |             6.393 (+-0.067)             |     1.314 (+-0.000)      |           4.270 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         5.085 (+-0.031)         |          4.220 (+-0.006)           |             6.426 (+-0.126)             |     1.523 (+-0.000)      |           4.780 (+-0.204)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.411 (+-0.004)         |          4.619 (+-0.005)           |             6.283 (+-0.114)             |     1.360 (+-0.000)      |           4.315 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.061 (+-0.083)        |          28.477 (+-0.026)          |             63.423 (+-0.464)            |     2.227 (+-0.000)      |           25.943 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.358 (+-0.086)        |          30.660 (+-0.328)          |             71.692 (+-0.282)            |     2.338 (+-0.000)      |           26.143 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         26.172 (+-0.124)        |          28.072 (+-0.039)          |             65.312 (+-0.478)            |     2.327 (+-0.000)      |           25.810 (+-0.344)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.522 (+-0.065)        |          30.480 (+-0.060)          |             71.560 (+-0.606)            |     2.348 (+-0.000)      |           26.105 (+-1.344)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.726 (+-0.344)        |          88.732 (+-0.194)          |            141.983 (+-0.551)            |     1.600 (+-0.000)      |           89.228 (+-0.300)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         88.873 (+-0.366)        |          88.690 (+-0.196)          |            141.351 (+-0.456)            |     1.594 (+-0.000)      |           89.257 (+-0.326)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        110.747 (+-0.742)        |          69.262 (+-0.174)          |            228.701 (+-8.460)            |     3.302 (+-0.000)      |          112.709 (+-0.746)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.729 (+-0.421)        |          68.543 (+-0.096)          |            230.542 (+-0.656)            |     3.363 (+-0.000)      |          112.994 (+-0.644)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         78.248 (+-0.323)        |          68.913 (+-0.227)          |            148.836 (+-0.244)            |     2.160 (+-0.000)      |           79.004 (+-0.973)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.898 (+-0.362)        |          68.819 (+-0.218)          |            149.036 (+-0.566)            |     2.166 (+-0.000)      |           78.681 (+-0.309)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        111.041 (+-0.404)        |          69.329 (+-0.100)          |            184.097 (+-0.673)            |     2.655 (+-0.000)      |          113.252 (+-0.585)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.903 (+-0.391)        |          70.003 (+-0.271)          |            183.848 (+-1.566)            |     2.626 (+-0.000)      |          113.787 (+-0.943)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.796 (+-0.536)        |          69.966 (+-0.218)          |            1793.246 (+-0.481)           |     25.630 (+-0.000)     |           92.416 (+-0.072)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.744 (+-0.439)        |          87.140 (+-0.101)          |            1457.581 (+-0.599)           |     16.727 (+-0.000)     |           92.510 (+-0.557)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.599 (+-0.331)        |          70.036 (+-0.280)          |            1800.172 (+-0.422)           |     25.704 (+-0.000)     |          112.876 (+-0.498)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        111.289 (+-0.346)        |          86.682 (+-0.020)          |            1463.788 (+-0.566)           |     16.887 (+-0.000)     |          112.987 (+-0.358)         

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md)

[ghstack-poisoned]
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bilinear mode, CF


```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         8.466 (+-0.072)         |          15.557 (+-0.054)          |             13.292 (+-0.113)            |     0.854 (+-0.000)      |           7.567 (+-0.037)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         8.685 (+-0.035)         |          11.384 (+-0.024)          |             15.798 (+-0.036)            |     1.388 (+-0.000)      |           7.489 (+-0.114)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         8.572 (+-0.085)         |          15.867 (+-0.046)          |             12.964 (+-0.050)            |     0.817 (+-0.000)      |           7.623 (+-0.126)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         8.834 (+-0.169)         |          11.447 (+-0.030)          |             15.386 (+-0.061)            |     1.344 (+-0.000)      |           7.647 (+-0.030)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         5.039 (+-0.011)         |          4.569 (+-0.016)           |             6.383 (+-0.038)             |     1.397 (+-0.000)      |           4.504 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.326 (+-0.008)         |          4.867 (+-0.013)           |             6.393 (+-0.067)             |     1.314 (+-0.000)      |           4.270 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         5.085 (+-0.031)         |          4.220 (+-0.006)           |             6.426 (+-0.126)             |     1.523 (+-0.000)      |           4.780 (+-0.204)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.411 (+-0.004)         |          4.619 (+-0.005)           |             6.283 (+-0.114)             |     1.360 (+-0.000)      |           4.315 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.061 (+-0.083)        |          28.477 (+-0.026)          |             63.423 (+-0.464)            |     2.227 (+-0.000)      |           25.943 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.358 (+-0.086)        |          30.660 (+-0.328)          |             71.692 (+-0.282)            |     2.338 (+-0.000)      |           26.143 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         26.172 (+-0.124)        |          28.072 (+-0.039)          |             65.312 (+-0.478)            |     2.327 (+-0.000)      |           25.810 (+-0.344)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.522 (+-0.065)        |          30.480 (+-0.060)          |             71.560 (+-0.606)            |     2.348 (+-0.000)      |           26.105 (+-1.344)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.726 (+-0.344)        |          88.732 (+-0.194)          |            141.983 (+-0.551)            |     1.600 (+-0.000)      |           89.228 (+-0.300)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         88.873 (+-0.366)        |          88.690 (+-0.196)          |            141.351 (+-0.456)            |     1.594 (+-0.000)      |           89.257 (+-0.326)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        110.747 (+-0.742)        |          69.262 (+-0.174)          |            228.701 (+-8.460)            |     3.302 (+-0.000)      |          112.709 (+-0.746)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.729 (+-0.421)        |          68.543 (+-0.096)          |            230.542 (+-0.656)            |     3.363 (+-0.000)      |          112.994 (+-0.644)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         78.248 (+-0.323)        |          68.913 (+-0.227)          |            148.836 (+-0.244)            |     2.160 (+-0.000)      |           79.004 (+-0.973)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.898 (+-0.362)        |          68.819 (+-0.218)          |            149.036 (+-0.566)            |     2.166 (+-0.000)      |           78.681 (+-0.309)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        111.041 (+-0.404)        |          69.329 (+-0.100)          |            184.097 (+-0.673)            |     2.655 (+-0.000)      |          113.252 (+-0.585)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.903 (+-0.391)        |          70.003 (+-0.271)          |            183.848 (+-1.566)            |     2.626 (+-0.000)      |          113.787 (+-0.943)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.796 (+-0.536)        |          69.966 (+-0.218)          |            1793.246 (+-0.481)           |     25.630 (+-0.000)     |           92.416 (+-0.072)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.744 (+-0.439)        |          87.140 (+-0.101)          |            1457.581 (+-0.599)           |     16.727 (+-0.000)     |           92.510 (+-0.557)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.599 (+-0.331)        |          70.036 (+-0.280)          |            1800.172 (+-0.422)           |     25.704 (+-0.000)     |          112.876 (+-0.498)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        111.289 (+-0.346)        |          86.682 (+-0.020)          |            1463.788 (+-0.566)           |     16.887 (+-0.000)     |          112.987 (+-0.358)         

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda with bilinear, nearest and bicubic modes
- slowdown on cpu with bilinear mode, CF


```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         8.466 (+-0.072)         |          15.557 (+-0.054)          |             13.292 (+-0.113)            |     0.854 (+-0.000)      |           7.567 (+-0.037)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         8.685 (+-0.035)         |          11.384 (+-0.024)          |             15.798 (+-0.036)            |     1.388 (+-0.000)      |           7.489 (+-0.114)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         8.572 (+-0.085)         |          15.867 (+-0.046)          |             12.964 (+-0.050)            |     0.817 (+-0.000)      |           7.623 (+-0.126)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         8.834 (+-0.169)         |          11.447 (+-0.030)          |             15.386 (+-0.061)            |     1.344 (+-0.000)      |           7.647 (+-0.030)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         5.039 (+-0.011)         |          4.569 (+-0.016)           |             6.383 (+-0.038)             |     1.397 (+-0.000)      |           4.504 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         4.326 (+-0.008)         |          4.867 (+-0.013)           |             6.393 (+-0.067)             |     1.314 (+-0.000)      |           4.270 (+-0.066)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         5.085 (+-0.031)         |          4.220 (+-0.006)           |             6.426 (+-0.126)             |     1.523 (+-0.000)      |           4.780 (+-0.204)          
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         4.411 (+-0.004)         |          4.619 (+-0.005)           |             6.283 (+-0.114)             |     1.360 (+-0.000)      |           4.315 (+-0.028)          
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         26.061 (+-0.083)        |          28.477 (+-0.026)          |             63.423 (+-0.464)            |     2.227 (+-0.000)      |           25.943 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         26.358 (+-0.086)        |          30.660 (+-0.328)          |             71.692 (+-0.282)            |     2.338 (+-0.000)      |           26.143 (+-0.299)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |         26.172 (+-0.124)        |          28.072 (+-0.039)          |             65.312 (+-0.478)            |     2.327 (+-0.000)      |           25.810 (+-0.344)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |         26.522 (+-0.065)        |          30.480 (+-0.060)          |             71.560 (+-0.606)            |     2.348 (+-0.000)      |           26.105 (+-1.344)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+git1c48419) PR  |  Compiled (2.1.0a0+gitbcdd413) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitbcdd413) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         88.726 (+-0.344)        |          88.732 (+-0.194)          |            141.983 (+-0.551)            |     1.600 (+-0.000)      |           89.228 (+-0.300)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         88.873 (+-0.366)        |          88.690 (+-0.196)          |            141.351 (+-0.456)            |     1.594 (+-0.000)      |           89.257 (+-0.326)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        110.747 (+-0.742)        |          69.262 (+-0.174)          |            228.701 (+-8.460)            |     3.302 (+-0.000)      |          112.709 (+-0.746)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        110.729 (+-0.421)        |          68.543 (+-0.096)          |            230.542 (+-0.656)            |     3.363 (+-0.000)      |          112.994 (+-0.644)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         78.248 (+-0.323)        |          68.913 (+-0.227)          |            148.836 (+-0.244)            |     2.160 (+-0.000)      |           79.004 (+-0.973)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         77.898 (+-0.362)        |          68.819 (+-0.218)          |            149.036 (+-0.566)            |     2.166 (+-0.000)      |           78.681 (+-0.309)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        111.041 (+-0.404)        |          69.329 (+-0.100)          |            184.097 (+-0.673)            |     2.655 (+-0.000)      |          113.252 (+-0.585)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        110.903 (+-0.391)        |          70.003 (+-0.271)          |            183.848 (+-1.566)            |     2.626 (+-0.000)      |          113.787 (+-0.943)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.796 (+-0.536)        |          69.966 (+-0.218)          |            1793.246 (+-0.481)           |     25.630 (+-0.000)     |           92.416 (+-0.072)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.744 (+-0.439)        |          87.140 (+-0.101)          |            1457.581 (+-0.599)           |     16.727 (+-0.000)     |           92.510 (+-0.557)         
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        110.599 (+-0.331)        |          70.036 (+-0.280)          |            1800.172 (+-0.422)           |     25.704 (+-0.000)     |          112.876 (+-0.498)         
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        111.289 (+-0.346)        |          86.682 (+-0.020)          |            1463.788 (+-0.566)           |     16.887 (+-0.000)     |          112.987 (+-0.358)         

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@vfdev-5 vfdev-5 marked this pull request as ready for review August 18, 2023 08:07
vfdev-5 added a commit that referenced this pull request Aug 28, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: b7d669b
Pull Request resolved: #104710
@vfdev-5 vfdev-5 requested a review from lezcano August 28, 2023 14:52
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Aug 28, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 28, 2023
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Aug 28, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 52598e95500417e2328246166b773c128feed100 returned non-zero exit code 1

Auto-merging test/expect/HasDecompTest.test_aten_core_operators.expect
Auto-merging torch/_decomp/__init__.py
Auto-merging torch/_decomp/decompositions.py
Auto-merging torch/_inductor/decomposition.py
CONFLICT (content): Merge conflict in torch/_inductor/decomposition.py
error: could not apply 52598e95500... [inductor] Improved grid_sampler_2d decomposition for cuda
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
Details for Dev Infra team Raised by workflow job

Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:
- speed-up on cuda (~x5) and cpu (~x2) for bicubic mode

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)         

Times are in microseconds (us).

```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md)

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
vfdev-5 added a commit that referenced this pull request Aug 28, 2023
Description:
- Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

ghstack-source-id: 9c506db
Pull Request resolved: #104710
@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Aug 28, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@vfdev-5
Copy link
Contributor Author

vfdev-5 commented Aug 29, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@vfdev-5 vfdev-5 deleted the gh/vfdev-5/14/head branch August 29, 2023 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants