-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[MPS] Add searchsorted op #112829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MPS] Add searchsorted op #112829
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112829
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 96a4912 with merge base 3a284da ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overal LGTM, but please add description (that says that it implements operator as a metal kernel following closely Bucketization.cu)
Also, would be good to add some sort of perf numbers (to show that it's faster than CPU or large enough tensors)
The metal kernels implemented are closely following `Bucketization.cu`.
```
[----------------------------- searchsorted ----------------------------]
| cpu | mps
1 threads: --------------------------------------------------------------
Batch size: 8; In features: 64; Sorter: True | 44 | 530
Batch size: 8; In features: 64; Sorter: False | 31 | 12
Batch size: 8; In features: 256; Sorter: True | 131 | 520
Batch size: 8; In features: 256; Sorter: False | 107 | 12
Batch size: 8; In features: 1024; Sorter: True | 499 | 590
Batch size: 8; In features: 1024; Sorter: False | 398 | 12
Batch size: 16; In features: 64; Sorter: True | 71 | 540
Batch size: 16; In features: 64; Sorter: False | 57 | 12
Batch size: 16; In features: 256; Sorter: True | 242 | 610
Batch size: 16; In features: 256; Sorter: False | 200 | 12
Batch size: 16; In features: 1024; Sorter: True | 999 | 720
Batch size: 16; In features: 1024; Sorter: False | 842 | 12
Batch size: 32; In features: 64; Sorter: True | 124 | 509
Batch size: 32; In features: 64; Sorter: False | 103 | 12
Batch size: 32; In features: 256; Sorter: True | 477 | 650
Batch size: 32; In features: 256; Sorter: False | 407 | 12
Batch size: 32; In features: 1024; Sorter: True | 1940 | 833
Batch size: 32; In features: 1024; Sorter: False | 1710 | 12
Batch size: 64; In features: 64; Sorter: True | 231 | 590
Batch size: 64; In features: 64; Sorter: False | 194 | 12
Batch size: 64; In features: 256; Sorter: True | 937 | 710
Batch size: 64; In features: 256; Sorter: False | 800 | 13
Batch size: 64; In features: 1024; Sorter: True | 3980 | 1290
Batch size: 64; In features: 1024; Sorter: False | 3330 | 12
Batch size: 128; In features: 64; Sorter: True | 448 | 650
Batch size: 128; In features: 64; Sorter: False | 390 | 13
Batch size: 128; In features: 256; Sorter: True | 1830 | 850
Batch size: 128; In features: 256; Sorter: False | 1590 | 12
Batch size: 128; In features: 1024; Sorter: True | 7790 | 2850
Batch size: 128; In features: 1024; Sorter: False | 6670 | 13
```
[ghstack-poisoned]
Pull Request resolved: #112830 Approved by: https://github.com/kulinseth, https://github.com/malfet ghstack dependencies: #112829
The metal kernels implemented are closely following `Bucketization.cu`.
Benchmark:
```
[----------------------------- searchsorted ----------------------------]
| cpu | mps
1 threads: --------------------------------------------------------------
Batch size: 8; In features: 64; Sorter: True | 44 | 530
Batch size: 8; In features: 64; Sorter: False | 31 | 12
Batch size: 8; In features: 256; Sorter: True | 131 | 520
Batch size: 8; In features: 256; Sorter: False | 107 | 12
Batch size: 8; In features: 1024; Sorter: True | 499 | 590
Batch size: 8; In features: 1024; Sorter: False | 398 | 12
Batch size: 16; In features: 64; Sorter: True | 71 | 540
Batch size: 16; In features: 64; Sorter: False | 57 | 12
Batch size: 16; In features: 256; Sorter: True | 242 | 610
Batch size: 16; In features: 256; Sorter: False | 200 | 12
Batch size: 16; In features: 1024; Sorter: True | 999 | 720
Batch size: 16; In features: 1024; Sorter: False | 842 | 12
Batch size: 32; In features: 64; Sorter: True | 124 | 509
Batch size: 32; In features: 64; Sorter: False | 103 | 12
Batch size: 32; In features: 256; Sorter: True | 477 | 650
Batch size: 32; In features: 256; Sorter: False | 407 | 12
Batch size: 32; In features: 1024; Sorter: True | 1940 | 833
Batch size: 32; In features: 1024; Sorter: False | 1710 | 12
Batch size: 64; In features: 64; Sorter: True | 231 | 590
Batch size: 64; In features: 64; Sorter: False | 194 | 12
Batch size: 64; In features: 256; Sorter: True | 937 | 710
Batch size: 64; In features: 256; Sorter: False | 800 | 13
Batch size: 64; In features: 1024; Sorter: True | 3980 | 1290
Batch size: 64; In features: 1024; Sorter: False | 3330 | 12
Batch size: 128; In features: 64; Sorter: True | 448 | 650
Batch size: 128; In features: 64; Sorter: False | 390 | 13
Batch size: 128; In features: 256; Sorter: True | 1830 | 850
Batch size: 128; In features: 256; Sorter: False | 1590 | 12
Batch size: 128; In features: 1024; Sorter: True | 7790 | 2850
Batch size: 128; In features: 1024; Sorter: False | 6670 | 13
```
Pull Request resolved: pytorch#112829
Approved by: https://github.com/malfet
Pull Request resolved: pytorch#112830 Approved by: https://github.com/kulinseth, https://github.com/malfet ghstack dependencies: pytorch#112829
Stack from ghstack (oldest at bottom):
The metal kernels implemented are closely following
Bucketization.cu.Benchmark: