[MPS] Compile kernels into Metallib #138636

malfet · 2024-10-22T20:32:19Z

PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using - id<MTLLibrary>newLibraryWithSource:options:error: (first metal kernel to MPS backend was added in #82307 )
Later on, as number of operator grew, those were refactored into MetalShaderLibrary convenience class (see #125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into .metalib

This PR does exactly that by:

Moving shader sources into separate .metal files
Adds check on whether full Xcode installed or just DeveloperTools
If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using -sectcreate linker option and getsectiondata API call. metallib_dummy.cpp file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
If only DeveloperTools CLI is installed, automatically wraps .metal into _metallib.h that contains shader source wrapped in MetalShaderLibrary

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in aten/src/ATen/native/mps/operators folder, corresponding .metal file is created in aten/src/ATen/native/mps/kernels folder and embedded shader definition is replaced with the following

#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif

Some historical stats:

PyTorch Version	Number of shaders in MPS	Ops added
1.12	0
1.13	2	bitwise_ops and index.out
2.0	4	cross repeat and view)
2.1	9	unary_ops, histogram, renorm, binary_ops
2.2	11	gamma and bucketization
2.3	12	naive_matmul (to workaround crash)
2.4	13	quantized_mm
2.5	14	fused_adam

Pros:

Better code structure/readability
Eventually allows one to use shared headers (and implement something like TensorIterator)
Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:

Build process is a bit more complicated that it used to be
Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

cc @albanD

pytorch-bot · 2024-10-22T20:32:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138636

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 01db75e with merge base 5e4c8b6 ():

NEW FAILURE - The following job has failed:

linux-binary-manywheel / manywheel-py3_12-xpu-build / build (gh)
ModuleNotFoundError: No module named 'setuptools'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

qqaatw · 2024-10-24T01:21:21Z

Nice improvement! We rely on existing tests?

malfet · 2024-10-24T01:27:43Z

Nice improvement! We rely on existing tests?

Yes, existing tests should be sufficient, though I would need to figure out how to install Xcode on our runners, otherwise they'll have to use embedded codepath

manuelcandales · 2024-10-25T20:36:58Z

cmake/Metal.cmake

    return()
 endif()

+set(BFLOAT_METAL_CODE "


Why do we have metal code in the .cmake file? Can me move that inside some bfloat_utils.metal file?

This one is not a real kernels, it just checks if host can compile something, this is a pretty standard practice to embed trivial test code in .cmake files, see

pytorch/cmake/Modules/FindAVX.cmake

Lines 5 to 14 in 86b45bd

SET(AVX_CODE "

#include <immintrin.h>

int main()

{

__m256 a;

a = _mm256_set1_ps(0);

return 0;

}

")

for example

Pros: - Faster runtime/validation, allows one to use shared headers Cons: - Needs a bit of an build system work - Needs XCode rather just a CommandLineTools to make it work

malfet · 2024-11-01T21:45:37Z

@pytorchbot merge -f "Builds and tests are green"

pytorchmergebot · 2024-11-01T21:47:06Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

swolchok · 2024-11-02T17:39:22Z

started getting build errors on my mac about missing aten/src/ATen/metallib_dummy.cpp locally on mac running python setup.py develop, even when running with --cmake or after python setup.py clean. working around via USE_MPS=OFF python setup.py develop

wm901115nwpu · 2024-11-04T06:20:02Z

started getting build errors on my mac about missing aten/src/ATen/metallib_dummy.cpp locally on mac running python setup.py develop, even when running with --cmake or after python setup.py clean. working around via USE_MPS=OFF python setup.py develop

pytorch/caffe2/CMakeLists.txt

Line 688 in 3179eb1

list(APPEND Caffe2_CPU_SRCS aten/src/ATen/metallib_dummy.cpp)

Not sure how it works on some machines, but clean build fails for me after #138636 was landed, even though it works fine on another machine. Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule

Not sure how it works on some machines, but clean build fails for me after #138636 was landed, even though it works fine on another machine. Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule Pull Request resolved: #139651 Approved by: https://github.com/atalman

PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in pytorch#82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see pytorch#125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: | PyTorch Version | Number of shaders in MPS | Ops added | | ------------- | ------------- | ---- | | 1.12 | 0 | | | 1.13 | 2 | bitwise_ops and index.out | | 2.0 | 4 | cross repeat and view) | | 2.1 | 9 | unary_ops, histogram, renorm, binary_ops | | 2.2 | 11 | gamma and bucketization | | 2.3 | 12 | naive_matmul (to workaround crash) | | 2.4 | 13 | quantized_mm | | 2.5 | 14 | fused_adam | Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: pytorch#138636 Approved by: https://github.com/manuelcandales

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Oct 22, 2024

malfet requested a review from albanD October 23, 2024 00:38

malfet added the topic: improvements topic category label Oct 23, 2024

malfet marked this pull request as ready for review October 23, 2024 23:50

malfet requested a review from kulinseth as a code owner October 23, 2024 23:50

malfet added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Oct 24, 2024

malfet requested a review from manuelcandales October 25, 2024 02:23

malfet force-pushed the malfet/mps-compile-shaders branch from 94e7c46 to 3a0a1f2 Compare October 25, 2024 16:59

malfet added the skip-pr-sanity-checks label Oct 25, 2024

manuelcandales approved these changes Oct 25, 2024

View reviewed changes

malfet added 16 commits November 1, 2024 10:11

[MPS] Compile kernels into Metallib

87d5940

Pros: - Faster runtime/validation, allows one to use shared headers Cons: - Needs a bit of an build system work - Needs XCode rather just a CommandLineTools to make it work

And add option to load it from bundle

30f91c9

And bundle bfloat kernels as well

2c38dd7

And fix lint

9cfef1b

And this one as well

a6a72f9

Add check for full Xcode

1103b32

And add JIT fallback

c62298e

Fix detection again

7dd5f25

Move upsample shader to its own file

49d6dca

Add SpecialOps

f4995e4

Add BinaryOps

1a92f43

Add Bucketization

948aebd

Add Im2Col

1d45f80

Move quantized ops

7cffc79

Add historgam ops

8a90fd1

Fix lint

54beb1a

malfet added 7 commits November 1, 2024 10:11

Keep names aligned

18fc9c4

And those two

cab1d30

Add Renorm

2f5d4cf

And LinearAlgebra, the last one that could be moved trivially

2780594

And move functions to Metal.cmake

541d062

Unrelated change

1c934ba

Move repeat

9f3fd2e

malfet force-pushed the malfet/mps-compile-shaders branch from 05e37d3 to 9f3fd2e Compare November 1, 2024 17:11

Add proper dependency generation

01db75e

pytorchmergebot added the merging label Nov 1, 2024

pytorchmergebot added the Merged label Nov 1, 2024

pytorchmergebot closed this in a1f854f Nov 1, 2024

pytorchmergebot removed the merging label Nov 1, 2024

malfet mentioned this pull request Nov 4, 2024

[CMake] Fix local MPS builds #139651

Closed

github-actions bot deleted the malfet/mps-compile-shaders branch December 5, 2024 02:13

manuelcandales mentioned this pull request Jan 23, 2025

Add fused rms_norm implementation for MPS backend #145301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Compile kernels into Metallib #138636

[MPS] Compile kernels into Metallib #138636

Uh oh!

malfet commented Oct 22, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 22, 2024 •

edited

Loading

Uh oh!

qqaatw commented Oct 24, 2024

Uh oh!

malfet commented Oct 24, 2024

Uh oh!

manuelcandales Oct 25, 2024 •

edited

Loading

Uh oh!

malfet Oct 25, 2024

Uh oh!

malfet commented Nov 1, 2024

Uh oh!

pytorchmergebot commented Nov 1, 2024

Uh oh!

swolchok commented Nov 2, 2024

Uh oh!

wm901115nwpu commented Nov 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

	SET(AVX_CODE "
	#include <immintrin.h>

	int main()
	{
	__m256 a;
	a = _mm256_set1_ps(0);
	return 0;
	}
	")

[MPS] Compile kernels into Metallib #138636

[MPS] Compile kernels into Metallib #138636

Uh oh!

Conversation

malfet commented Oct 22, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138636

❌ 1 New Failure

Uh oh!

qqaatw commented Oct 24, 2024

Uh oh!

malfet commented Oct 24, 2024

Uh oh!

manuelcandales Oct 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet Oct 25, 2024

Choose a reason for hiding this comment

Uh oh!

malfet commented Nov 1, 2024

Uh oh!

pytorchmergebot commented Nov 1, 2024

Merge started

Uh oh!

swolchok commented Nov 2, 2024

Uh oh!

wm901115nwpu commented Nov 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

malfet commented Oct 22, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 22, 2024 •

edited

Loading

manuelcandales Oct 25, 2024 •

edited

Loading