[inductor] Improved grid_sampler_2d decomposition for cuda #104710

vfdev-5 · 2023-07-06T14:55:33Z

Stack from ghstack (oldest at bottom):

-> [inductor] Improved grid_sampler_2d decomposition for cuda #104710

Description:

Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two

Related to #104296

Perfs:

speed-up on cuda (~x5) and cpu (~x2) for bicubic mode

Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly"

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |         38.010 (+-0.118)        |          51.466 (+-1.257)          |             47.867 (+-0.124)            |     0.930 (+-0.000)      |           33.654 (+-0.411)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |         35.532 (+-0.236)        |          52.189 (+-0.093)          |             58.979 (+-0.206)            |     1.130 (+-0.000)      |           32.543 (+-0.198)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |         38.187 (+-0.112)        |          47.892 (+-0.117)          |             45.833 (+-0.081)            |     0.957 (+-0.000)      |           33.752 (+-0.116)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |         36.708 (+-0.244)        |          51.680 (+-0.104)          |             58.360 (+-0.108)            |     1.129 (+-0.000)      |           32.576 (+-0.751)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |         24.201 (+-0.088)        |          27.451 (+-0.059)          |             27.937 (+-0.081)            |     1.018 (+-0.000)      |           24.367 (+-0.074)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |         19.266 (+-0.105)        |          26.070 (+-0.085)          |             26.092 (+-0.054)            |     1.001 (+-0.000)      |           20.144 (+-0.064)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |         24.293 (+-0.125)        |          26.085 (+-0.064)          |             26.575 (+-0.061)            |     1.019 (+-0.000)      |           24.515 (+-0.095)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |         19.440 (+-0.075)        |          25.252 (+-0.059)          |             25.259 (+-0.051)            |     1.000 (+-0.000)      |           19.770 (+-0.070)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        114.900 (+-0.508)        |         113.416 (+-1.271)          |            248.679 (+-1.431)            |     2.193 (+-0.000)      |          114.609 (+-0.515)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        115.973 (+-0.555)        |         124.711 (+-1.596)          |            282.187 (+-2.418)            |     2.263 (+-0.000)      |          115.368 (+-0.652)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        111.730 (+-0.562)        |         110.914 (+-0.865)          |            253.899 (+-2.226)            |     2.289 (+-0.000)      |          111.285 (+-1.226)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        112.859 (+-0.487)        |         131.696 (+-1.298)          |            294.124 (+-1.963)            |     2.233 (+-0.000)      |          110.910 (+-0.969)         

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+git52598e9) PR  |  Compiled (2.1.0a0+gitcf76938) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+gitcf76938) Nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |        228.811 (+-0.037)        |          92.990 (+-0.446)          |             92.648 (+-0.286)            |     0.996 (+-0.000)      |          228.274 (+-0.067)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |        222.107 (+-0.076)        |          93.247 (+-0.387)          |             92.528 (+-0.423)            |     0.992 (+-0.000)      |          221.922 (+-0.297)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |        235.654 (+-0.055)        |          75.781 (+-0.566)          |            115.865 (+-0.419)            |     1.529 (+-0.000)      |          236.032 (+-0.111)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |        226.752 (+-0.088)        |          76.312 (+-0.328)          |            116.468 (+-0.477)            |     1.526 (+-0.000)      |          226.950 (+-0.027)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |        225.540 (+-0.013)        |          75.638 (+-0.341)          |             72.621 (+-0.292)            |     0.960 (+-0.000)      |          225.937 (+-0.017)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |        217.425 (+-0.024)        |          75.484 (+-0.545)          |             73.518 (+-0.296)            |     0.974 (+-0.000)      |          217.793 (+-0.008)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |        231.474 (+-0.020)        |          75.972 (+-0.339)          |             73.030 (+-0.387)            |     0.961 (+-0.000)      |          231.991 (+-0.184)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |        223.408 (+-0.016)        |          75.622 (+-0.279)          |             73.542 (+-0.336)            |     0.973 (+-0.000)      |          223.893 (+-0.021)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |        319.382 (+-0.023)        |         149.060 (+-0.190)          |            772.116 (+-0.266)            |     5.180 (+-0.000)      |          320.549 (+-0.387)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |        319.987 (+-0.134)        |         154.443 (+-0.014)          |            797.651 (+-0.232)            |     5.165 (+-0.000)      |          320.665 (+-0.397)         
      Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        326.138 (+-0.439)        |         149.092 (+-0.036)          |            772.508 (+-0.259)            |     5.181 (+-0.000)      |          325.751 (+-0.398)         
      Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        326.024 (+-0.118)        |         154.452 (+-0.209)          |            797.756 (+-0.229)            |     5.165 (+-0.000)      |          326.870 (+-0.372)         

Times are in microseconds (us).

Source

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two [ghstack-poisoned]

pytorch-bot · 2023-07-06T14:55:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104710

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 74b97a3 with merge base bcda859 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: b444b36 Pull Request resolved: #104710

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bicubic and bilinear modes (bilinear may be some noise) ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git421fe7b) PR-afgg" and "Compiled (2.1.0a0+gitd3ba890) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 7.489 (+-0.287) | 16.801 (+-0.138) | 13.330 (+-0.028) | 0.793 (+-0.000) | 7.522 (+-0.044) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 7.587 (+-0.031) | 12.494 (+-0.066) | 16.024 (+-0.128) | 1.283 (+-0.000) | 7.530 (+-0.086) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 7.808 (+-0.038) | 20.410 (+-1.616) | 13.149 (+-0.200) | 0.644 (+-0.000) | 7.672 (+-0.124) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 7.989 (+-0.034) | 12.130 (+-0.033) | 15.698 (+-0.118) | 1.294 (+-0.000) | 7.745 (+-0.078) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 4.593 (+-0.030) | 5.848 (+-0.012) | 6.471 (+-0.087) | 1.106 (+-0.000) | 4.388 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.190 (+-0.008) | 5.979 (+-0.008) | 6.490 (+-0.069) | 1.085 (+-0.000) | 4.397 (+-0.021) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 4.582 (+-0.011) | 5.465 (+-0.024) | 6.464 (+-0.193) | 1.183 (+-0.000) | 4.793 (+-0.024) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.403 (+-0.004) | 5.866 (+-0.007) | 6.688 (+-0.196) | 1.140 (+-0.000) | 4.370 (+-0.013) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.106 (+-0.138) | 104.156 (+-3.881) | 64.199 (+-0.402) | 0.616 (+-0.000) | 26.645 (+-0.173) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.192 (+-0.141) | 102.890 (+-1.249) | 71.674 (+-0.679) | 0.697 (+-0.000) | 26.498 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 25.752 (+-0.133) | 99.068 (+-3.399) | 66.274 (+-0.172) | 0.669 (+-0.000) | 26.758 (+-0.081) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.366 (+-0.082) | 103.052 (+-1.758) | 72.297 (+-0.398) | 0.702 (+-0.000) | 26.535 (+-0.145) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.257 (+-0.462) | 125.216 (+-0.401) | 136.807 (+-0.636) | 1.093 (+-0.000) | 86.542 (+-0.393) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 87.649 (+-0.440) | 125.382 (+-3.365) | 136.184 (+-0.356) | 1.086 (+-0.000) | 86.255 (+-0.361) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 111.428 (+-0.511) | 108.644 (+-0.338) | 221.696 (+-0.820) | 2.041 (+-0.000) | 110.533 (+-0.721) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.717 (+-0.458) | 108.719 (+-0.427) | 222.547 (+-0.218) | 2.047 (+-0.000) | 110.922 (+-0.567) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 77.541 (+-0.317) | 108.937 (+-0.301) | 142.400 (+-0.258) | 1.307 (+-0.000) | 76.351 (+-0.443) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.313 (+-0.341) | 108.872 (+-0.421) | 142.147 (+-0.709) | 1.306 (+-0.000) | 76.435 (+-0.390) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 110.669 (+-0.475) | 109.328 (+-0.345) | 178.589 (+-0.474) | 1.634 (+-0.000) | 110.797 (+-0.527) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.605 (+-0.521) | 109.049 (+-0.401) | 178.601 (+-0.382) | 1.638 (+-0.000) | 109.887 (+-0.417) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.652 (+-0.097) | 333.377 (+-0.011) | 1800.770 (+-0.552) | 5.402 (+-0.000) | 92.892 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.280 (+-0.373) | 334.606 (+-0.026) | 1463.572 (+-0.644) | 4.374 (+-0.000) | 92.596 (+-0.489) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.864 (+-0.533) | 333.195 (+-0.016) | 1806.567 (+-0.456) | 5.422 (+-0.000) | 110.970 (+-0.444) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 112.047 (+-0.700) | 334.676 (+-0.028) | 1470.514 (+-1.586) | 4.394 (+-0.000) | 110.857 (+-0.506) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230706-162617-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: ab74cf4 Pull Request resolved: #104710

torch/_decomp/decompositions.py

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bicubic and bilinear modes (bilinear may be some noise) ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git421fe7b) PR-afgg" and "Compiled (2.1.0a0+gitd3ba890) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 7.489 (+-0.287) | 16.801 (+-0.138) | 13.330 (+-0.028) | 0.793 (+-0.000) | 7.522 (+-0.044) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 7.587 (+-0.031) | 12.494 (+-0.066) | 16.024 (+-0.128) | 1.283 (+-0.000) | 7.530 (+-0.086) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 7.808 (+-0.038) | 20.410 (+-1.616) | 13.149 (+-0.200) | 0.644 (+-0.000) | 7.672 (+-0.124) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 7.989 (+-0.034) | 12.130 (+-0.033) | 15.698 (+-0.118) | 1.294 (+-0.000) | 7.745 (+-0.078) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 4.593 (+-0.030) | 5.848 (+-0.012) | 6.471 (+-0.087) | 1.106 (+-0.000) | 4.388 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.190 (+-0.008) | 5.979 (+-0.008) | 6.490 (+-0.069) | 1.085 (+-0.000) | 4.397 (+-0.021) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 4.582 (+-0.011) | 5.465 (+-0.024) | 6.464 (+-0.193) | 1.183 (+-0.000) | 4.793 (+-0.024) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.403 (+-0.004) | 5.866 (+-0.007) | 6.688 (+-0.196) | 1.140 (+-0.000) | 4.370 (+-0.013) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.106 (+-0.138) | 104.156 (+-3.881) | 64.199 (+-0.402) | 0.616 (+-0.000) | 26.645 (+-0.173) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.192 (+-0.141) | 102.890 (+-1.249) | 71.674 (+-0.679) | 0.697 (+-0.000) | 26.498 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 25.752 (+-0.133) | 99.068 (+-3.399) | 66.274 (+-0.172) | 0.669 (+-0.000) | 26.758 (+-0.081) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.366 (+-0.082) | 103.052 (+-1.758) | 72.297 (+-0.398) | 0.702 (+-0.000) | 26.535 (+-0.145) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.257 (+-0.462) | 125.216 (+-0.401) | 136.807 (+-0.636) | 1.093 (+-0.000) | 86.542 (+-0.393) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 87.649 (+-0.440) | 125.382 (+-3.365) | 136.184 (+-0.356) | 1.086 (+-0.000) | 86.255 (+-0.361) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 111.428 (+-0.511) | 108.644 (+-0.338) | 221.696 (+-0.820) | 2.041 (+-0.000) | 110.533 (+-0.721) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.717 (+-0.458) | 108.719 (+-0.427) | 222.547 (+-0.218) | 2.047 (+-0.000) | 110.922 (+-0.567) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 77.541 (+-0.317) | 108.937 (+-0.301) | 142.400 (+-0.258) | 1.307 (+-0.000) | 76.351 (+-0.443) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.313 (+-0.341) | 108.872 (+-0.421) | 142.147 (+-0.709) | 1.306 (+-0.000) | 76.435 (+-0.390) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 110.669 (+-0.475) | 109.328 (+-0.345) | 178.589 (+-0.474) | 1.634 (+-0.000) | 110.797 (+-0.527) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.605 (+-0.521) | 109.049 (+-0.401) | 178.601 (+-0.382) | 1.638 (+-0.000) | 109.887 (+-0.417) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.652 (+-0.097) | 333.377 (+-0.011) | 1800.770 (+-0.552) | 5.402 (+-0.000) | 92.892 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.280 (+-0.373) | 334.606 (+-0.026) | 1463.572 (+-0.644) | 4.374 (+-0.000) | 92.596 (+-0.489) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.864 (+-0.533) | 333.195 (+-0.016) | 1806.567 (+-0.456) | 5.422 (+-0.000) | 110.970 (+-0.444) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 112.047 (+-0.700) | 334.676 (+-0.028) | 1470.514 (+-1.586) | 4.394 (+-0.000) | 110.857 (+-0.506) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230706-162617-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: 38d904a Pull Request resolved: #104710

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: 52d3638 Pull Request resolved: #104710

vfdev-5 · 2023-07-11T16:25:26Z

Set this PR as draft, as perf for bicubic,cuda compiled function depends on output memory format:

      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |         92.651 (+-0.390)        |          70.458 (+-0.018)        
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |         92.567 (+-0.493)        |         1236.525 (+-0.301)       
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |        113.788 (+-0.407)        |          68.739 (+-0.085)        
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |        114.094 (+-0.303)        |         1235.920 (+-0.349)       

Times are in microseconds (us).

and aten grid sampler version does not respect the memory format: if input is channels last, output will be channels first

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: f91d008 Pull Request resolved: #104710

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: b7d669b Pull Request resolved: #104710

torch/_inductor/decomposition.py

vfdev-5 · 2023-08-28T16:40:55Z

@pytorchbot merge

pytorchmergebot · 2023-08-28T16:42:34Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

vfdev-5 · 2023-08-28T16:44:54Z

@pytorchbot merge

pytorchmergebot · 2023-08-28T16:47:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-28T19:29:55Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 52598e95500417e2328246166b773c128feed100 returned non-zero exit code 1

Auto-merging test/expect/HasDecompTest.test_aten_core_operators.expect
Auto-merging torch/_decomp/__init__.py
Auto-merging torch/_decomp/decompositions.py
Auto-merging torch/_inductor/decomposition.py
CONFLICT (content): Merge conflict in torch/_inductor/decomposition.py
error: could not apply 52598e95500... [inductor] Improved grid_sampler_2d decomposition for cuda
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".

Details for Dev Infra team

Raised by workflow job

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda (~x5) and cpu (~x2) for bicubic mode ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+gitcf76938) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitcf76938) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 38.010 (+-0.118) | 51.466 (+-1.257) | 47.867 (+-0.124) | 0.930 (+-0.000) | 33.654 (+-0.411) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 35.532 (+-0.236) | 52.189 (+-0.093) | 58.979 (+-0.206) | 1.130 (+-0.000) | 32.543 (+-0.198) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 38.187 (+-0.112) | 47.892 (+-0.117) | 45.833 (+-0.081) | 0.957 (+-0.000) | 33.752 (+-0.116) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 36.708 (+-0.244) | 51.680 (+-0.104) | 58.360 (+-0.108) | 1.129 (+-0.000) | 32.576 (+-0.751) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 24.201 (+-0.088) | 27.451 (+-0.059) | 27.937 (+-0.081) | 1.018 (+-0.000) | 24.367 (+-0.074) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 19.266 (+-0.105) | 26.070 (+-0.085) | 26.092 (+-0.054) | 1.001 (+-0.000) | 20.144 (+-0.064) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 24.293 (+-0.125) | 26.085 (+-0.064) | 26.575 (+-0.061) | 1.019 (+-0.000) | 24.515 (+-0.095) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 19.440 (+-0.075) | 25.252 (+-0.059) | 25.259 (+-0.051) | 1.000 (+-0.000) | 19.770 (+-0.070) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 114.900 (+-0.508) | 113.416 (+-1.271) | 248.679 (+-1.431) | 2.193 (+-0.000) | 114.609 (+-0.515) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 115.973 (+-0.555) | 124.711 (+-1.596) | 282.187 (+-2.418) | 2.263 (+-0.000) | 115.368 (+-0.652) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 111.730 (+-0.562) | 110.914 (+-0.865) | 253.899 (+-2.226) | 2.289 (+-0.000) | 111.285 (+-1.226) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 112.859 (+-0.487) | 131.696 (+-1.298) | 294.124 (+-1.963) | 2.233 (+-0.000) | 110.910 (+-0.969) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+gitcf76938) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitcf76938) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 228.811 (+-0.037) | 92.990 (+-0.446) | 92.648 (+-0.286) | 0.996 (+-0.000) | 228.274 (+-0.067) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 222.107 (+-0.076) | 93.247 (+-0.387) | 92.528 (+-0.423) | 0.992 (+-0.000) | 221.922 (+-0.297) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 235.654 (+-0.055) | 75.781 (+-0.566) | 115.865 (+-0.419) | 1.529 (+-0.000) | 236.032 (+-0.111) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 226.752 (+-0.088) | 76.312 (+-0.328) | 116.468 (+-0.477) | 1.526 (+-0.000) | 226.950 (+-0.027) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 225.540 (+-0.013) | 75.638 (+-0.341) | 72.621 (+-0.292) | 0.960 (+-0.000) | 225.937 (+-0.017) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 217.425 (+-0.024) | 75.484 (+-0.545) | 73.518 (+-0.296) | 0.974 (+-0.000) | 217.793 (+-0.008) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 231.474 (+-0.020) | 75.972 (+-0.339) | 73.030 (+-0.387) | 0.961 (+-0.000) | 231.991 (+-0.184) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 223.408 (+-0.016) | 75.622 (+-0.279) | 73.542 (+-0.336) | 0.973 (+-0.000) | 223.893 (+-0.021) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 319.382 (+-0.023) | 149.060 (+-0.190) | 772.116 (+-0.266) | 5.180 (+-0.000) | 320.549 (+-0.387) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 319.987 (+-0.134) | 154.443 (+-0.014) | 797.651 (+-0.232) | 5.165 (+-0.000) | 320.665 (+-0.397) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 326.138 (+-0.439) | 149.092 (+-0.036) | 772.508 (+-0.259) | 5.181 (+-0.000) | 325.751 (+-0.398) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 326.024 (+-0.118) | 154.452 (+-0.209) | 797.756 (+-0.229) | 5.165 (+-0.000) | 326.870 (+-0.372) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two ghstack-source-id: 9c506db Pull Request resolved: #104710

vfdev-5 · 2023-08-28T20:04:25Z

@pytorchbot merge

pytorchmergebot · 2023-08-28T20:05:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-08-28T20:56:32Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

vfdev-5 · 2023-08-29T05:52:38Z

@pytorchbot merge

pytorchmergebot · 2023-08-29T05:54:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[inductor] Improved grid_sampler_2d decomposition for cuda

1df4b9f

Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two [ghstack-poisoned]

vfdev-5 mentioned this pull request Jul 6, 2023

[inductor] Added affine_grid_generator decomposition #104709

Closed

github-actions bot added the ciflow/inductor label Jul 6, 2023

lezcano approved these changes Jul 7, 2023

View reviewed changes

torch/_decomp/decompositions.py Outdated Show resolved Hide resolved

pytorchbot added the open source label Jul 7, 2023

vfdev-5 requested a review from lezcano July 11, 2023 10:14

lezcano approved these changes Jul 11, 2023

View reviewed changes

vfdev-5 marked this pull request as draft July 11, 2023 16:23

lezcano mentioned this pull request Jul 24, 2023

affine_grid and grid_sample operators merge/accelleration #104296

Open

github-actions bot added the module: inductor label Aug 18, 2023

vfdev-5 marked this pull request as ready for review August 18, 2023 08:07

vfdev-5 requested a review from lezcano August 28, 2023 14:52

lezcano reviewed Aug 28, 2023

View reviewed changes

torch/_inductor/decomposition.py Show resolved Hide resolved

lezcano approved these changes Aug 28, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 28, 2023

pytorchmergebot added the merging label Aug 28, 2023

pytorchmergebot removed the merging label Aug 28, 2023

vfdev-5 added the topic: not user facing topic category label Aug 28, 2023

pytorchmergebot added the merging label Aug 28, 2023

pytorchmergebot removed the merging label Aug 28, 2023

pytorchmergebot added the merging label Aug 28, 2023

pytorchmergebot removed the merging label Aug 28, 2023

pytorchmergebot added the merging label Aug 29, 2023

pytorchmergebot added Merged and removed merging labels Aug 29, 2023

pytorchmergebot closed this in 0cfc589 Aug 29, 2023

vfdev-5 deleted the gh/vfdev-5/14/head branch August 29, 2023 07:18

[inductor] Improved grid_sampler_2d decomposition for cuda #104710

[inductor] Improved grid_sampler_2d decomposition for cuda #104710

Uh oh!

Conversation

vfdev-5 commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104710

✅ No Failures

Uh oh!

Uh oh!

vfdev-5 commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vfdev-5 commented Aug 28, 2023

Uh oh!

pytorchmergebot commented Aug 28, 2023

Merge failed

Uh oh!

vfdev-5 commented Aug 28, 2023

Uh oh!

pytorchmergebot commented Aug 28, 2023

Merge started

Uh oh!

pytorchmergebot commented Aug 28, 2023

Merge failed

Uh oh!

vfdev-5 commented Aug 28, 2023

Uh oh!

pytorchmergebot commented Aug 28, 2023

Merge started

Uh oh!

pytorchmergebot commented Aug 28, 2023

Merge failed

Uh oh!

vfdev-5 commented Aug 29, 2023

Uh oh!

pytorchmergebot commented Aug 29, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vfdev-5 commented Jul 6, 2023 •

edited

Loading

pytorch-bot bot commented Jul 6, 2023 •

edited

Loading

vfdev-5 commented Jul 11, 2023 •

edited

Loading