KEMBAR78
[MPS] Compile kernels into Metallib by malfet · Pull Request #138636 · pytorch/pytorch · GitHub
Skip to content

Conversation

@malfet
Copy link
Contributor

@malfet malfet commented Oct 22, 2024

PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using - id<MTLLibrary>newLibraryWithSource:options:error: (first metal kernel to MPS backend was added in #82307 )
Later on, as number of operator grew, those were refactored into MetalShaderLibrary convenience class (see #125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into .metalib

This PR does exactly that by:

  • Moving shader sources into separate .metal files
  • Adds check on whether full Xcode installed or just DeveloperTools
  • If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using -sectcreate linker option and getsectiondata API call. metallib_dummy.cpp file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
  • If only DeveloperTools CLI is installed, automatically wraps .metal into _metallib.h that contains shader source wrapped in MetalShaderLibrary

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in aten/src/ATen/native/mps/operators folder, corresponding .metal file is created in aten/src/ATen/native/mps/kernels folder and embedded shader definition is replaced with the following

#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif

Some historical stats:

PyTorch Version Number of shaders in MPS Ops added
1.12 0
1.13 2 bitwise_ops and index.out
2.0 4 cross repeat and view)
2.1 9 unary_ops, histogram, renorm, binary_ops
2.2 11 gamma and bucketization
2.3 12 naive_matmul (to workaround crash)
2.4 13 quantized_mm
2.5 14 fused_adam

Pros:

  • Better code structure/readability
  • Eventually allows one to use shared headers (and implement something like TensorIterator)
  • Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:

  • Build process is a bit more complicated that it used to be
  • Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

cc @albanD

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138636

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 01db75e with merge base 5e4c8b6 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Oct 22, 2024
@malfet malfet requested a review from albanD October 23, 2024 00:38
@malfet malfet added the topic: improvements topic category label Oct 23, 2024
@malfet malfet marked this pull request as ready for review October 23, 2024 23:50
@malfet malfet requested a review from kulinseth as a code owner October 23, 2024 23:50
@qqaatw
Copy link
Collaborator

qqaatw commented Oct 24, 2024

Nice improvement! We rely on existing tests?

@malfet
Copy link
Contributor Author

malfet commented Oct 24, 2024

Nice improvement! We rely on existing tests?

Yes, existing tests should be sufficient, though I would need to figure out how to install Xcode on our runners, otherwise they'll have to use embedded codepath

@malfet malfet added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Oct 24, 2024
@malfet malfet requested a review from manuelcandales October 25, 2024 02:23
@malfet malfet force-pushed the malfet/mps-compile-shaders branch from 94e7c46 to 3a0a1f2 Compare October 25, 2024 16:59
return()
endif()

set(BFLOAT_METAL_CODE "
Copy link
Contributor

@manuelcandales manuelcandales Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have metal code in the .cmake file? Can me move that inside some bfloat_utils.metal file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is not a real kernels, it just checks if host can compile something, this is a pretty standard practice to embed trivial test code in .cmake files, see

SET(AVX_CODE "
#include <immintrin.h>
int main()
{
__m256 a;
a = _mm256_set1_ps(0);
return 0;
}
")
for example

@malfet malfet force-pushed the malfet/mps-compile-shaders branch from 05e37d3 to 9f3fd2e Compare November 1, 2024 17:11
@malfet
Copy link
Contributor Author

malfet commented Nov 1, 2024

@pytorchbot merge -f "Builds and tests are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@swolchok
Copy link
Contributor

swolchok commented Nov 2, 2024

started getting build errors on my mac about missing aten/src/ATen/metallib_dummy.cpp locally on mac running python setup.py develop, even when running with --cmake or after python setup.py clean. working around via USE_MPS=OFF python setup.py develop

@wm901115nwpu
Copy link

started getting build errors on my mac about missing aten/src/ATen/metallib_dummy.cpp locally on mac running python setup.py develop, even when running with --cmake or after python setup.py clean. working around via USE_MPS=OFF python setup.py develop

list(APPEND Caffe2_CPU_SRCS aten/src/ATen/metallib_dummy.cpp)

malfet added a commit that referenced this pull request Nov 4, 2024
Not sure how it works on some machines, but clean build fails for me after #138636 was landed, even though it works fine on another machine.

Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule
pytorchmergebot pushed a commit that referenced this pull request Nov 4, 2024
Not sure how it works on some machines, but clean build fails for me after #138636 was landed, even though it works fine on another machine.

Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule

Pull Request resolved: #139651
Approved by: https://github.com/atalman
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in pytorch#82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  pytorch#125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: pytorch#138636
Approved by: https://github.com/manuelcandales
@github-actions github-actions bot deleted the malfet/mps-compile-shaders branch December 5, 2024 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/mps Run MPS tests (subset of trunk) Merged release notes: mps Release notes category skip-pr-sanity-checks topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants