-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[inductor] Improved grid_sampler_2d decomposition for cuda #104710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/104710
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 74b97a3 with merge base bcda859 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bicubic and bilinear modes (bilinear may be some noise) ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git421fe7b) PR-afgg" and "Compiled (2.1.0a0+gitd3ba890) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 7.489 (+-0.287) | 16.801 (+-0.138) | 13.330 (+-0.028) | 0.793 (+-0.000) | 7.522 (+-0.044) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 7.587 (+-0.031) | 12.494 (+-0.066) | 16.024 (+-0.128) | 1.283 (+-0.000) | 7.530 (+-0.086) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 7.808 (+-0.038) | 20.410 (+-1.616) | 13.149 (+-0.200) | 0.644 (+-0.000) | 7.672 (+-0.124) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 7.989 (+-0.034) | 12.130 (+-0.033) | 15.698 (+-0.118) | 1.294 (+-0.000) | 7.745 (+-0.078) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 4.593 (+-0.030) | 5.848 (+-0.012) | 6.471 (+-0.087) | 1.106 (+-0.000) | 4.388 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.190 (+-0.008) | 5.979 (+-0.008) | 6.490 (+-0.069) | 1.085 (+-0.000) | 4.397 (+-0.021) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 4.582 (+-0.011) | 5.465 (+-0.024) | 6.464 (+-0.193) | 1.183 (+-0.000) | 4.793 (+-0.024) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.403 (+-0.004) | 5.866 (+-0.007) | 6.688 (+-0.196) | 1.140 (+-0.000) | 4.370 (+-0.013) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.106 (+-0.138) | 104.156 (+-3.881) | 64.199 (+-0.402) | 0.616 (+-0.000) | 26.645 (+-0.173) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.192 (+-0.141) | 102.890 (+-1.249) | 71.674 (+-0.679) | 0.697 (+-0.000) | 26.498 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 25.752 (+-0.133) | 99.068 (+-3.399) | 66.274 (+-0.172) | 0.669 (+-0.000) | 26.758 (+-0.081) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.366 (+-0.082) | 103.052 (+-1.758) | 72.297 (+-0.398) | 0.702 (+-0.000) | 26.535 (+-0.145) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.257 (+-0.462) | 125.216 (+-0.401) | 136.807 (+-0.636) | 1.093 (+-0.000) | 86.542 (+-0.393) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 87.649 (+-0.440) | 125.382 (+-3.365) | 136.184 (+-0.356) | 1.086 (+-0.000) | 86.255 (+-0.361) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 111.428 (+-0.511) | 108.644 (+-0.338) | 221.696 (+-0.820) | 2.041 (+-0.000) | 110.533 (+-0.721) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.717 (+-0.458) | 108.719 (+-0.427) | 222.547 (+-0.218) | 2.047 (+-0.000) | 110.922 (+-0.567) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 77.541 (+-0.317) | 108.937 (+-0.301) | 142.400 (+-0.258) | 1.307 (+-0.000) | 76.351 (+-0.443) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.313 (+-0.341) | 108.872 (+-0.421) | 142.147 (+-0.709) | 1.306 (+-0.000) | 76.435 (+-0.390) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 110.669 (+-0.475) | 109.328 (+-0.345) | 178.589 (+-0.474) | 1.634 (+-0.000) | 110.797 (+-0.527) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.605 (+-0.521) | 109.049 (+-0.401) | 178.601 (+-0.382) | 1.638 (+-0.000) | 109.887 (+-0.417) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.652 (+-0.097) | 333.377 (+-0.011) | 1800.770 (+-0.552) | 5.402 (+-0.000) | 92.892 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.280 (+-0.373) | 334.606 (+-0.026) | 1463.572 (+-0.644) | 4.374 (+-0.000) | 92.596 (+-0.489) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.864 (+-0.533) | 333.195 (+-0.016) | 1806.567 (+-0.456) | 5.422 (+-0.000) | 110.970 (+-0.444) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 112.047 (+-0.700) | 334.676 (+-0.028) | 1470.514 (+-1.586) | 4.394 (+-0.000) | 110.857 (+-0.506) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230706-162617-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bicubic and bilinear modes (bilinear may be some noise) ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git421fe7b) PR-afgg" and "Compiled (2.1.0a0+gitd3ba890) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 7.489 (+-0.287) | 16.801 (+-0.138) | 13.330 (+-0.028) | 0.793 (+-0.000) | 7.522 (+-0.044) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 7.587 (+-0.031) | 12.494 (+-0.066) | 16.024 (+-0.128) | 1.283 (+-0.000) | 7.530 (+-0.086) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 7.808 (+-0.038) | 20.410 (+-1.616) | 13.149 (+-0.200) | 0.644 (+-0.000) | 7.672 (+-0.124) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 7.989 (+-0.034) | 12.130 (+-0.033) | 15.698 (+-0.118) | 1.294 (+-0.000) | 7.745 (+-0.078) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 4.593 (+-0.030) | 5.848 (+-0.012) | 6.471 (+-0.087) | 1.106 (+-0.000) | 4.388 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.190 (+-0.008) | 5.979 (+-0.008) | 6.490 (+-0.069) | 1.085 (+-0.000) | 4.397 (+-0.021) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 4.582 (+-0.011) | 5.465 (+-0.024) | 6.464 (+-0.193) | 1.183 (+-0.000) | 4.793 (+-0.024) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.403 (+-0.004) | 5.866 (+-0.007) | 6.688 (+-0.196) | 1.140 (+-0.000) | 4.370 (+-0.013) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.106 (+-0.138) | 104.156 (+-3.881) | 64.199 (+-0.402) | 0.616 (+-0.000) | 26.645 (+-0.173) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.192 (+-0.141) | 102.890 (+-1.249) | 71.674 (+-0.679) | 0.697 (+-0.000) | 26.498 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 25.752 (+-0.133) | 99.068 (+-3.399) | 66.274 (+-0.172) | 0.669 (+-0.000) | 26.758 (+-0.081) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.366 (+-0.082) | 103.052 (+-1.758) | 72.297 (+-0.398) | 0.702 (+-0.000) | 26.535 (+-0.145) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+git421fe7b) PR | Compiled (2.1.0a0+gitd3ba890) Nightly | Speed-up PR vs Nightly | Eager (2.1.0a0+gitd3ba890) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.257 (+-0.462) | 125.216 (+-0.401) | 136.807 (+-0.636) | 1.093 (+-0.000) | 86.542 (+-0.393) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 87.649 (+-0.440) | 125.382 (+-3.365) | 136.184 (+-0.356) | 1.086 (+-0.000) | 86.255 (+-0.361) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 111.428 (+-0.511) | 108.644 (+-0.338) | 221.696 (+-0.820) | 2.041 (+-0.000) | 110.533 (+-0.721) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.717 (+-0.458) | 108.719 (+-0.427) | 222.547 (+-0.218) | 2.047 (+-0.000) | 110.922 (+-0.567) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 77.541 (+-0.317) | 108.937 (+-0.301) | 142.400 (+-0.258) | 1.307 (+-0.000) | 76.351 (+-0.443) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.313 (+-0.341) | 108.872 (+-0.421) | 142.147 (+-0.709) | 1.306 (+-0.000) | 76.435 (+-0.390) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 110.669 (+-0.475) | 109.328 (+-0.345) | 178.589 (+-0.474) | 1.634 (+-0.000) | 110.797 (+-0.527) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.605 (+-0.521) | 109.049 (+-0.401) | 178.601 (+-0.382) | 1.638 (+-0.000) | 109.887 (+-0.417) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.652 (+-0.097) | 333.377 (+-0.011) | 1800.770 (+-0.552) | 5.402 (+-0.000) | 92.892 (+-0.220) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.280 (+-0.373) | 334.606 (+-0.026) | 1463.572 (+-0.644) | 4.374 (+-0.000) | 92.596 (+-0.489) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.864 (+-0.533) | 333.195 (+-0.016) | 1806.567 (+-0.456) | 5.422 (+-0.000) | 110.970 (+-0.444) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 112.047 (+-0.700) | 334.676 (+-0.028) | 1470.514 (+-1.586) | 4.394 (+-0.000) | 110.857 (+-0.506) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230706-162617-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]
|
Set this PR as draft, as perf for bicubic,cuda compiled function depends on output memory format: and aten grid sampler version does not respect the memory format: if input is channels last, output will be channels first |
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) [ghstack-poisoned]
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda with bilinear, nearest and bicubic modes - slowdown on cpu with bilinear mode, CF ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git1c48419) PR" and "Compiled (2.1.0a0+gitbcdd413) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 8.466 (+-0.072) | 15.557 (+-0.054) | 13.292 (+-0.113) | 0.854 (+-0.000) | 7.567 (+-0.037) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 8.685 (+-0.035) | 11.384 (+-0.024) | 15.798 (+-0.036) | 1.388 (+-0.000) | 7.489 (+-0.114) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 8.572 (+-0.085) | 15.867 (+-0.046) | 12.964 (+-0.050) | 0.817 (+-0.000) | 7.623 (+-0.126) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 8.834 (+-0.169) | 11.447 (+-0.030) | 15.386 (+-0.061) | 1.344 (+-0.000) | 7.647 (+-0.030) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 5.039 (+-0.011) | 4.569 (+-0.016) | 6.383 (+-0.038) | 1.397 (+-0.000) | 4.504 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 4.326 (+-0.008) | 4.867 (+-0.013) | 6.393 (+-0.067) | 1.314 (+-0.000) | 4.270 (+-0.066) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 5.085 (+-0.031) | 4.220 (+-0.006) | 6.426 (+-0.126) | 1.523 (+-0.000) | 4.780 (+-0.204) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 4.411 (+-0.004) | 4.619 (+-0.005) | 6.283 (+-0.114) | 1.360 (+-0.000) | 4.315 (+-0.028) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 26.061 (+-0.083) | 28.477 (+-0.026) | 63.423 (+-0.464) | 2.227 (+-0.000) | 25.943 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 26.358 (+-0.086) | 30.660 (+-0.328) | 71.692 (+-0.282) | 2.338 (+-0.000) | 26.143 (+-0.299) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 26.172 (+-0.124) | 28.072 (+-0.039) | 65.312 (+-0.478) | 2.327 (+-0.000) | 25.810 (+-0.344) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 26.522 (+-0.065) | 30.480 (+-0.060) | 71.560 (+-0.606) | 2.348 (+-0.000) | 26.105 (+-1.344) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+git1c48419) PR | Compiled (2.1.0a0+gitbcdd413) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitbcdd413) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 88.726 (+-0.344) | 88.732 (+-0.194) | 141.983 (+-0.551) | 1.600 (+-0.000) | 89.228 (+-0.300) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 88.873 (+-0.366) | 88.690 (+-0.196) | 141.351 (+-0.456) | 1.594 (+-0.000) | 89.257 (+-0.326) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 110.747 (+-0.742) | 69.262 (+-0.174) | 228.701 (+-8.460) | 3.302 (+-0.000) | 112.709 (+-0.746) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 110.729 (+-0.421) | 68.543 (+-0.096) | 230.542 (+-0.656) | 3.363 (+-0.000) | 112.994 (+-0.644) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 78.248 (+-0.323) | 68.913 (+-0.227) | 148.836 (+-0.244) | 2.160 (+-0.000) | 79.004 (+-0.973) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 77.898 (+-0.362) | 68.819 (+-0.218) | 149.036 (+-0.566) | 2.166 (+-0.000) | 78.681 (+-0.309) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 111.041 (+-0.404) | 69.329 (+-0.100) | 184.097 (+-0.673) | 2.655 (+-0.000) | 113.252 (+-0.585) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 110.903 (+-0.391) | 70.003 (+-0.271) | 183.848 (+-1.566) | 2.626 (+-0.000) | 113.787 (+-0.943) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 92.796 (+-0.536) | 69.966 (+-0.218) | 1793.246 (+-0.481) | 25.630 (+-0.000) | 92.416 (+-0.072) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 92.744 (+-0.439) | 87.140 (+-0.101) | 1457.581 (+-0.599) | 16.727 (+-0.000) | 92.510 (+-0.557) Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 110.599 (+-0.331) | 70.036 (+-0.280) | 1800.172 (+-0.422) | 25.704 (+-0.000) | 112.876 (+-0.498) Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 111.289 (+-0.346) | 86.682 (+-0.020) | 1463.788 (+-0.566) | 16.887 (+-0.000) | 112.987 (+-0.358) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230711-113600-affine-grid-sampler-PR-vs-Nightly-speedup.md) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
Description: - Improved grid_sampler_2d decomposition code to generate single cuda kernel instead of two Related to #104296 Perfs: - speed-up on cuda (~x5) and cpu (~x2) for bicubic mode ``` Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git52598e9) PR" and "Compiled (2.1.0a0+gitcf76938) Nightly" [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cpu -------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+gitcf76938) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitcf76938) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 38.010 (+-0.118) | 51.466 (+-1.257) | 47.867 (+-0.124) | 0.930 (+-0.000) | 33.654 (+-0.411) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 35.532 (+-0.236) | 52.189 (+-0.093) | 58.979 (+-0.206) | 1.130 (+-0.000) | 32.543 (+-0.198) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 38.187 (+-0.112) | 47.892 (+-0.117) | 45.833 (+-0.081) | 0.957 (+-0.000) | 33.752 (+-0.116) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 36.708 (+-0.244) | 51.680 (+-0.104) | 58.360 (+-0.108) | 1.129 (+-0.000) | 32.576 (+-0.751) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 24.201 (+-0.088) | 27.451 (+-0.059) | 27.937 (+-0.081) | 1.018 (+-0.000) | 24.367 (+-0.074) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 19.266 (+-0.105) | 26.070 (+-0.085) | 26.092 (+-0.054) | 1.001 (+-0.000) | 20.144 (+-0.064) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 24.293 (+-0.125) | 26.085 (+-0.064) | 26.575 (+-0.061) | 1.019 (+-0.000) | 24.515 (+-0.095) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 19.440 (+-0.075) | 25.252 (+-0.059) | 25.259 (+-0.051) | 1.000 (+-0.000) | 19.770 (+-0.070) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 114.900 (+-0.508) | 113.416 (+-1.271) | 248.679 (+-1.431) | 2.193 (+-0.000) | 114.609 (+-0.515) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 115.973 (+-0.555) | 124.711 (+-1.596) | 282.187 (+-2.418) | 2.263 (+-0.000) | 115.368 (+-0.652) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 111.730 (+-0.562) | 110.914 (+-0.865) | 253.899 (+-2.226) | 2.289 (+-0.000) | 111.285 (+-1.226) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 112.859 (+-0.487) | 131.696 (+-1.298) | 294.124 (+-1.963) | 2.233 (+-0.000) | 110.910 (+-0.969) Times are in milliseconds (ms). [------------------------------------------------------------------------------------------------------------------------------- Affine grid sampling, cuda ------------------------------------------------------------------------------------------------------------------------------] | Eager (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+git52598e9) PR | Compiled (2.1.0a0+gitcf76938) Nightly | speed-up PR vs Nightly | Eager (2.1.0a0+gitcf76938) Nightly 1 threads: -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear | 228.811 (+-0.037) | 92.990 (+-0.446) | 92.648 (+-0.286) | 0.996 (+-0.000) | 228.274 (+-0.067) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear | 222.107 (+-0.076) | 93.247 (+-0.387) | 92.528 (+-0.423) | 0.992 (+-0.000) | 221.922 (+-0.297) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear | 235.654 (+-0.055) | 75.781 (+-0.566) | 115.865 (+-0.419) | 1.529 (+-0.000) | 236.032 (+-0.111) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear | 226.752 (+-0.088) | 76.312 (+-0.328) | 116.468 (+-0.477) | 1.526 (+-0.000) | 226.950 (+-0.027) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest | 225.540 (+-0.013) | 75.638 (+-0.341) | 72.621 (+-0.292) | 0.960 (+-0.000) | 225.937 (+-0.017) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest | 217.425 (+-0.024) | 75.484 (+-0.545) | 73.518 (+-0.296) | 0.974 (+-0.000) | 217.793 (+-0.008) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest | 231.474 (+-0.020) | 75.972 (+-0.339) | 73.030 (+-0.387) | 0.961 (+-0.000) | 231.991 (+-0.184) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest | 223.408 (+-0.016) | 75.622 (+-0.279) | 73.542 (+-0.336) | 0.973 (+-0.000) | 223.893 (+-0.021) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic | 319.382 (+-0.023) | 149.060 (+-0.190) | 772.116 (+-0.266) | 5.180 (+-0.000) | 320.549 (+-0.387) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic | 319.987 (+-0.134) | 154.443 (+-0.014) | 797.651 (+-0.232) | 5.165 (+-0.000) | 320.665 (+-0.397) Input: (8, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic | 326.138 (+-0.439) | 149.092 (+-0.036) | 772.508 (+-0.259) | 5.181 (+-0.000) | 325.751 (+-0.398) Input: (8, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic | 326.024 (+-0.118) | 154.452 (+-0.209) | 797.756 (+-0.229) | 5.165 (+-0.000) | 326.870 (+-0.372) Times are in microseconds (us). ``` [Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230828-134459-affine-grid-sampler-PR-vs-Nightly-speedup.md) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
Description:
Related to #104296
Perfs:
Source
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov