[MPS] Speedup `argmax`/`argmin` #159524

malfet · 2025-07-30T22:56:53Z

Stack from ghstack (oldest at bottom):

-> [MPS] Speedup argmax/argmin #159524

By using efficient threadgroup_arg[max|min] primitives.

Fixed bug in simd_argmax when result of the simd_ballot were prematurely cast to ushort and adjusted unit test
Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy

Now according to bench_mps_ops.py max(x, dim=0) is reliably faster than eager implementaiton:

[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float16)  |      285.8      |       272.2       |       422.3       |        354.5        |       721.6       |        683.5        |       2224.0      |        1979.1     
      max (torch.float32)  |      300.2      |       267.0       |       389.6       |        342.5        |       769.4       |        682.6        |       2995.7      |        2609.8     
      max (torch.int32)    |      299.6      |       275.4       |       390.0       |        361.7        |       758.7       |        686.1        |       3103.4      |        2646.5     
      max (torch.int64)    |      297.5      |       275.5       |       417.0       |        382.1        |       856.1       |        722.6        |       5467.7      |        3156.8

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-07-30T22:56:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159524

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 72 Pending

As of commit adf1276 with merge base 25343b3 ():
💚 Looks good so far! There are no failures yet. 💚

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

⏳ pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

By using efficient `threadgroup_arg[max|min]` primitives ghstack-source-id: 134819e Pull Request resolved: #159524

[ghstack-poisoned]

By using efficient `threadgroup_arg[max|min]` primitives ghstack-source-id: bcdd0a1 Pull Request resolved: #159524

malfet · 2025-07-31T16:16:42Z

@pytorchbot merge -f "Lint + MPS are green"

pytorchmergebot · 2025-07-31T16:18:15Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

By using efficient `threadgroup_arg[max|min]` primitives. - Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test - Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton: ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] | eager-512x512 | compile-512x512 | eager-1024x1024 | compile-1024x1024 | eager-2048x2048 | compile-2048x2048 | eager-4096x4096 | compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- max (torch.float16) | 285.8 | 272.2 | 422.3 | 354.5 | 721.6 | 683.5 | 2224.0 | 1979.1 max (torch.float32) | 300.2 | 267.0 | 389.6 | 342.5 | 769.4 | 682.6 | 2995.7 | 2609.8 max (torch.int32) | 299.6 | 275.4 | 390.0 | 361.7 | 758.7 | 686.1 | 3103.4 | 2646.5 max (torch.int64) | 297.5 | 275.5 | 417.0 | 382.1 | 856.1 | 722.6 | 5467.7 | 3156.8 ``` Pull Request resolved: #159524 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: #158990

Update

90ec9be

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) module: inductor labels Jul 30, 2025

malfet added a commit that referenced this pull request Jul 30, 2025

[MPS] Speedup argmax/armin

6421bd1

By using efficient `threadgroup_arg[max|min]` primitives ghstack-source-id: 134819e Pull Request resolved: #159524

malfet requested a review from dcci July 30, 2025 22:57

malfet added the topic: improvements topic category label Jul 30, 2025

Skylion007 changed the title ~~[MPS] Speedup argmax/armin~~ [MPS] Speedup argmax/argmin Jul 31, 2025

Skylion007 approved these changes Jul 31, 2025

View reviewed changes

dcci approved these changes Jul 31, 2025

View reviewed changes

Update

adf1276

[ghstack-poisoned]

malfet requested a review from kulinseth as a code owner July 31, 2025 15:43

malfet added a commit that referenced this pull request Jul 31, 2025

[MPS] Speedup argmax/armin

25ceddc

By using efficient `threadgroup_arg[max|min]` primitives ghstack-source-id: bcdd0a1 Pull Request resolved: #159524

malfet added the release notes: mps Release notes category label Jul 31, 2025

pytorchmergebot added the merging label Jul 31, 2025

pytorchmergebot closed this in f946b25 Jul 31, 2025

pytorchmergebot added Merged and removed merging labels Jul 31, 2025

github-actions bot deleted the gh/malfet/462/head branch August 31, 2025 02:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Speedup `argmax`/`argmin` #159524

[MPS] Speedup `argmax`/`argmin` #159524

Uh oh!

malfet commented Jul 30, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading

Uh oh!

malfet commented Jul 31, 2025

Uh oh!

pytorchmergebot commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[MPS] Speedup argmax/argmin #159524

[MPS] Speedup argmax/argmin #159524

Uh oh!

Conversation

malfet commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159524

⏳ No Failures, 72 Pending

Uh oh!

malfet commented Jul 31, 2025

Uh oh!

pytorchmergebot commented Jul 31, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[MPS] Speedup `argmax`/`argmin` #159524

[MPS] Speedup `argmax`/`argmin` #159524

malfet commented Jul 30, 2025 •

edited

Loading

pytorch-bot bot commented Jul 30, 2025 •

edited

Loading