KEMBAR78
[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel by swolchok · Pull Request #136331 · pytorch/pytorch · GitHub
Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Sep 19, 2024

Stack from ghstack (oldest at bottom):

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: D63045939

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136331

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 1 Unrelated Failure

As of commit 63d29fa with merge base 99eb47f (image):

NEW FAILURES - The following jobs have failed:

  • linux-aarch64 / linux-jammy-aarch64-py3.10 / build (gh)
    /var/lib/jenkins/workspace/aten/src/ATen/native/BlasKernel.cpp:582:25: error: inlining failed in call to ‘always_inline’ ‘float at::native::blas_impl::{anonymous}::dot_with_fp32_arith_tail_after_main_loop(const T*, const T*, int64_t, float) [with T = c10::BFloat16]’: target specific option mismatch
  • linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']
  • linux-binary-manywheel / manywheel-py3_9-cuda11_8-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']
  • linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']
  • linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test (gh)
    RuntimeError: recursive_directory_iterator in used pre-CXX11 binaries, see; ['std::filesystem::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::recursive_directory_iterator::depth() const', 'std::filesystem::recursive_directory_iterator::options() const', 'std::filesystem::recursive_directory_iterator::operator*() const', 'std::filesystem::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::recursive_directory_iterator::pop()', 'std::filesystem::recursive_directory_iterator::pop() [clone .cold]', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::recursive_directory_iterator::increment(std::error_code&) [clone .cold]', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator&&)', 'std::filesystem::recursive_directory_iterator::operator=(std::filesystem::recursive_directory_iterator const&)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::path const&, std::filesystem::directory_options, std::error_code*) [clone .cold]', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::recursive_directory_iterator::operator++()', 'std::filesystem::recursive_directory_iterator::operator++() [clone .cold]']

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63045939

swolchok added a commit that referenced this pull request Sep 19, 2024
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

ghstack-source-id: 243596406
Pull Request resolved: #136331
@swolchok swolchok requested review from albanD and malfet September 19, 2024 19:54
@swolchok
Copy link
Contributor Author

Note that this code is gated off under C10_MOBILE, but we have ExecuTorch for that anyway, right?

Comment on lines 247 to 253
#if defined(_MSC_VER)
#define C10_ALWAYS_INLINE_ATTRIBUTE
#elif __has_attribute(always_inline) || defined(__GNUC__)
#define C10_ALWAYS_INLINE_ATTRIBUTE __attribute__((__always_inline__))
#else
#define C10_ALWAYS_INLINE_ATTRIBUTE
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move it to a separate PR? Also, why is this needed?(any perf numbers) And why exclude clang?

Copy link
Contributor Author

@swolchok swolchok Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?(any perf numbers)

not having it prevents ForcedUnroll from working properly under -Oz. See pytorch/executorch#5247 .

why exclude clang

clang defines __GNUC__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate PR

done. #136445

…lasKernel"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63045939

…lasKernel"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63045939

swolchok added a commit that referenced this pull request Sep 23, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes not-yet-reviewed pytorch/executorch#5444 .
ghstack-source-id: 244187352

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
@swolchok swolchok requested a review from malfet September 23, 2024 22:08
…lasKernel"


ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63045939

swolchok added a commit that referenced this pull request Sep 30, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .
ghstack-source-id: 245455630

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 30, 2024
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failure look very much relevant!

static_assert(kF16RegistersPerIteration == kF16ElementsPerIteration / kF16ElementsPerRegister);

static inline double reduce(float16x8_t x[kF16RegistersPerIteration]) {
static inline float reduce(float16x8_t x[kF16RegistersPerIteration]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this done on purpose to preserve accuracy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @malfet asked for it. perhaps I got confused though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in any case, the numerical error on macos arm seem relevant and needs investigation right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in any case, the numerical error on macos arm seem relevant and needs investigation right?

yes, it's fixed. I accidentally replaced the float16 kernel with the bfloat16 kernel due to name changes

…lasKernel"


ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63045939

swolchok added a commit that referenced this pull request Oct 2, 2024
Pull Request resolved: #136331

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes #127488 . Includes pytorch/executorch#5444 .
ghstack-source-id: 245903000

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
@swolchok swolchok requested a review from albanD October 2, 2024 15:40
@swolchok
Copy link
Contributor Author

swolchok commented Oct 4, 2024

I will probably reapply this and just disable the BF16 option for GCC.

@abhishek-iitmadras
Copy link
Collaborator

I will probably reapply this and just disable the BF16 option for GCC.

Hi @swolchok (cc @malfet )

If I add below code:

TARGET_ARM_BF16_ATTRIBUTE
static C10_ALWAYS_INLINE float dot_with_fp32_arith_tail_after_main_loop_bf16(
    const BFloat16* vec1,
    const BFloat16* vec2,
    int64_t len,
    float reducedSum) {
  const auto len_aligned = len & ~(kF32ElementsPerIteration - 1);
  for (int j = len_aligned; j < len; ++j) {
    reducedSum += static_cast<float>(vec1[j]) * static_cast<float>(vec2[j]);
  }
  return reducedSum;
}

And modified dot_with_fp32_arith_bfdot to use dot_with_fp32_arith_tail_after_main_loop_bf16 specific to bf16 then build pass successfully.

#if COMPILER_SUPPORTS_BF16_TARGET
TARGET_ARM_BF16_ATTRIBUTE float
dot_with_fp32_arith_bfdot(const BFloat16* vec1, const BFloat16* vec2, int64_t len) {
  auto reducedSum = dot_with_fp32_arith_main_loop_bfdot(vec1, vec2, len);
  return dot_with_fp32_arith_tail_after_main_loop_bf16(vec1, vec2, len, reducedSum);
}
#endif // COMPILER_SUPPORTS_BF16_TARGET

So the issue here might me like in function dot_with_fp32_arith_bfdot, You try to return dot_with_fp32_arith_tail_after_main_loop and in dot_with_fp32_arith_tail_after_main_loop , you use dot_with_fp32_arith_vectorized_tail_inner_loop which is not specific to bf16.

Correct me if I am wrong.

@swolchok
Copy link
Contributor Author

swolchok commented Oct 4, 2024

you use dot_with_fp32_arith_vectorized_tail_inner_loop which is not specific to bf16.

Correct me if I am wrong.

It should be fine to inline a not-specific-to-bf16 function in a specific-to-bf16 function. In fact, clang does it correctly. I would prefer not to duplicate code to work around compiler bugs.

@swolchok
Copy link
Contributor Author

swolchok commented Oct 4, 2024

prefer not to duplicate code

it will be slightly ugly, but we can put the body of the one function we'd need to duplicate in a macro so as not to punish people for using ARM GCC. I will try that.

swolchok added a commit that referenced this pull request Oct 4, 2024
…Try #2

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 4, 2024
…Try #2

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

ghstack-source-id: 246411194
Pull Request resolved: #137377
@swolchok
Copy link
Contributor Author

swolchok commented Oct 4, 2024

#137377 is the second attempt

@swolchok swolchok closed this Oct 4, 2024
swolchok added a commit that referenced this pull request Oct 7, 2024
…t back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 7, 2024
…lasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 7, 2024
…t back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 7, 2024
…lasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 7, 2024
…Try #2

Pull Request resolved: #137377

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .
ghstack-source-id: 246616406

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)
swolchok added a commit that referenced this pull request Oct 8, 2024
…t back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 8, 2024
…lasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 8, 2024
…fdot improvement back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 8, 2024
… back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 8, 2024
…vement back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 8, 2024
…Ten BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 8, 2024
…Try #2

Pull Request resolved: #137377

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .
ghstack-source-id: 246956192

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)
swolchok added a commit that referenced this pull request Oct 10, 2024
…t back to ATen BlasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 10, 2024
…lasKernel, Try #2"

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 10, 2024
…Try #2 (#137377)

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)
Pull Request resolved: #137377
Approved by: https://github.com/malfet
swolchok added a commit that referenced this pull request Oct 10, 2024
…Try #2

ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was #136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)

[ghstack-poisoned]
@github-actions github-actions bot deleted the gh/swolchok/647/head branch November 6, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged Reverted topic: performance topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants