Enable TF32 as fp32 internal precision for matmul/linear/conv #157520

yanbing-j · 2025-07-03T01:32:06Z

Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in mkldnn backend. Since we have refined fp32 precision API in #125888, we can easily extend the API to support TF32 for mkldnn backend.

torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"

Related kernel update and UTs update are done. And the wrapper bf32_on_and _off is updated to reduced_f32_on_and_off, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including bf32 ON and tf32 ON).

Stack from ghstack (oldest at bottom):

-> Enable TF32 as fp32 internal precision for matmul/linear/conv #157520

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @Guobing-Chen @Xia-Weiwen @snadampal @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-07-03T01:32:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157520

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (12 Unrelated Failures)

As of commit 342dc1f with merge base d7e1b8b ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
torchrec_dlrm
inductor-rocm / rocm-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2) (gh) (trunk failure)
distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_optimizer_cudagraph
inductor-rocm-mi300 / rocm-py3.10-inductor-mi300 / test (inductor, 1, 2, linux.rocm.gpu.mi300.2) (gh) (trunk failure)
distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_optimizer_cudagraph

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

rocm / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.2, unstable) (gh) (#156098)
inductor/test_max_autotune.py::TestMaxAutotune::test_linear_and_cel
rocm / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.2, unstable) (gh) (#156098)
inductor/test_kernel_optimization.py::TestKernelOptimization::test_einsum_to_pointwise
rocm / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.2, unstable) (gh) (#156098)
inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_frozen_fn
rocm / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.2, unstable) (gh) (#156098)
inductor/test_inplace_padding.py::InplacePaddingTest::test_linear_and_cel
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 1, 6, linux.rocm.gpu.mi300.2, unstable) (gh) (#158182)
inductor/test_max_autotune.py::TestMaxAutotune::test_linear_and_cel
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 3, 6, linux.rocm.gpu.mi300.2, unstable) (gh) (#158182)
inductor/test_inplace_padding.py::InplacePaddingTest::test_linear_and_cel
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 4, 6, linux.rocm.gpu.mi300.2, unstable) (gh) (#158182)
inductor/test_kernel_optimization.py::TestKernelOptimization::test_einsum_to_pointwise
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 5, 6, linux.rocm.gpu.mi300.2, unstable) (gh) (#158182)
inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_frozen_fn
rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 6, 6, linux.rocm.gpu.mi300.2, unstable) (gh) (#158182)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

mingfeima · 2025-07-03T02:40:34Z

aten/src/ATen/Context.cpp

      float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
      setFloat32Precision("cuda", "matmul", "tf32");
-      setFloat32Precision("mkldnn", "matmul", "ieee");
+      setFloat32Precision("mkldnn", "matmul", "tf32");


i don't quite understand what this change means, "ieee" to "tf32"

In the description of #125888, it says,

We provide 3 fp32 compute precision can be set:
"ieee": Not allowed to use any other internal computation data types .
"tf32": Allowed to use tf32 as internal computation data types.
"bf16": Allowed to use bf16 as internal computation data types.
"none": Precision's are not set. Can be override by its father node.

"HIGHEST, HIGH, MEDIUM" is a legacy representation, means ieee, tf32, and bf16.

So without this PR, mkldnn backend only supports ieee, bf16 and none. If set to HIGH, tf32 is not supported in mkldnn, use ieee instead. With this PR, tf32 is supported in mkldnn, so we can use tf32 directly.

mingfeima

just some suggestions for simplifying the code.

mingfeima · 2025-07-03T02:48:38Z

aten/src/ATen/native/mkldnn/Matmul.cpp

+      mat2.numel() != 0 &&
+      checksize(mat1, mat2));
+}
+


this file has multiple functions that have similar usage:
use_mkldnn_bf16_matmul, use_mkldnn_fp16_matmul, use_mkldnn_bf32_matmul and use_mkldnn_tf32_matmul

can we templatize it to simplify the code?

template <typename T> bool use_mkldnn_matmul(); #if defined(__aarch64__) bool use_mkldnn_matmul<at::BFloat16>(); #endif

mingfeima · 2025-07-03T02:53:01Z

aten/src/ATen/native/mkldnn/Matmul.cpp

    const Tensor& result) {
-  return (use_mkldnn_bf16_matmul(mat1, mat2, result) || use_mkldnn_fp16_matmul(mat1, mat2, result) || use_mkldnn_bf32_matmul(mat1, mat2, result));
+  return (use_mkldnn_bf16_matmul(mat1, mat2, result) || use_mkldnn_fp16_matmul(mat1, mat2, result) || use_mkldnn_bf32_matmul(mat1, mat2, result) || use_mkldnn_tf32_matmul(mat1, mat2, result));
 }


if you can template use_mkldnn_matmul<>, then you can do something like:

AT_DISPATCH_FLOATING_AND2(kBFloat16, kHalf, ..., [&] { return use_mkldnn_matmul<scalar_t>(...); });

Done. Please take a look again.

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: fc7fed1 Pull Request resolved: #157520

[ghstack-poisoned]

jgong5 · 2025-07-03T06:02:25Z

aten/src/ATen/Context.cpp

 const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
    {"generic", {{"ieee", "tf32", "bf16", "none"}}},
-    {"mkldnn", {{"ieee", "bf16", "none"}}},
+    {"mkldnn", {{"ieee", "bf16", "tf32", "none"}}},


Why is the ordering different from "generic"?

No specific reason, I will change it same as generic.

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: bd019ac Pull Request resolved: #157520

[ghstack-poisoned]

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 4901b16 Pull Request resolved: #157520

[ghstack-poisoned]

yanbing-j · 2025-07-08T08:33:26Z

Hi @jansel , could you please also take a look at this PR? Similar as the previous PRs to enable BF32, we can easily extend the API to support TF32 for mkldnn backend. Thanks!

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 5365a8c Pull Request resolved: #157520

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 5365a8c Pull Request resolved: pytorch#157520

[ghstack-poisoned]

jansel

Test failures?

yanbing-j · 2025-07-15T05:55:59Z

Test failures?

The failures appear after I rebased yesterday, and it is instresting that the failures are same as #158209, which only updates warning log. Let me try to find the point.

Update: The failures are caused by #150762. See #150762 (comment)
I have do the rebase.

[ghstack-poisoned]

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 17e8040 Pull Request resolved: #157520

[ghstack-poisoned]

yanbing-j · 2025-07-16T08:44:27Z

Hi @jansel , CI all passes after rebase. Could you please take a look again? Thanks!

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: cc560bf Pull Request resolved: #157520

[ghstack-poisoned]

yanbing-j · 2025-07-17T08:49:50Z

@pytorchbot merge

pytorchmergebot · 2025-07-17T08:51:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…9024) This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix #159031 and #159551. Pull Request resolved: #159024 Approved by: https://github.com/mingfeima

yanbing-j requested review from IvanYashchuk, lezcano and nikitaved as code owners July 3, 2025 01:32

yanbing-j mentioned this pull request Jul 1, 2025

[inductor] enable bf32 for mkldnn linear pointwise/binary in inductor #127294

Closed

yanbing-j marked this pull request as draft July 3, 2025 01:32

yanbing-j changed the title ~~Enable TF32 as fp32 internal precision for matmul~~ Enable TF32 as fp32 internal precision for matmul/linear/conv Jul 3, 2025

Update

ea4ce8e

[ghstack-poisoned]

Update

ce55520

[ghstack-poisoned]

pytorchbot added the open source label Jul 3, 2025

yanbing-j added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ci-no-td Do not run TD on this PR ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Jul 3, 2025

yanbing-j requested a review from mingfeima July 3, 2025 01:56

mingfeima reviewed Jul 3, 2025

View reviewed changes

mingfeima approved these changes Jul 3, 2025

View reviewed changes

yanbing-j added a commit that referenced this pull request Jul 3, 2025

Enable TF32 as fp32 internal precision for matmul

b2bbcdc

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: fc7fed1 Pull Request resolved: #157520

Update

3b982c8

[ghstack-poisoned]

jgong5 reviewed Jul 3, 2025

View reviewed changes

yanbing-j added a commit that referenced this pull request Jul 3, 2025

Enable TF32 as fp32 internal precision for matmul

f8b3165

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: bd019ac Pull Request resolved: #157520

Update

6fc2ded

[ghstack-poisoned]

yanbing-j added a commit that referenced this pull request Jul 8, 2025

Enable TF32 as fp32 internal precision for matmul

d526b26

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 4901b16 Pull Request resolved: #157520

Update

db8e9c0

[ghstack-poisoned]

yanbing-j requested a review from jansel July 8, 2025 08:32

yanbing-j marked this pull request as ready for review July 8, 2025 08:32

yanbing-j added a commit that referenced this pull request Jul 14, 2025

Enable TF32 as fp32 internal precision for matmul

e3f1ea3

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 5365a8c Pull Request resolved: #157520

Update

6bbf03f

[ghstack-poisoned]

jansel requested changes Jul 14, 2025

View reviewed changes

Update

47c1b60

[ghstack-poisoned]

yanbing-j added a commit that referenced this pull request Jul 16, 2025

Enable TF32 as fp32 internal precision for matmul

764160a

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: 17e8040 Pull Request resolved: #157520

Update

64a922e

[ghstack-poisoned]

jansel approved these changes Jul 16, 2025

View reviewed changes

yanbing-j added a commit that referenced this pull request Jul 17, 2025

Enable TF32 as fp32 internal precision for matmul

eb3d40c

Enable TF32 as fp32 internal precision for Linear Enable TF32 as fp32 internal precision for conv ghstack-source-id: cc560bf Pull Request resolved: #157520

Update

342dc1f

[ghstack-poisoned]

pytorchmergebot added the merging label Jul 17, 2025

pytorchmergebot closed this in f4d8bc4 Jul 17, 2025

pytorchmergebot added Merged and removed merging labels Jul 17, 2025

yanbing-j mentioned this pull request Jul 24, 2025

Fix perf downgrad by reverting template use in use_mkldnn_matmul #159024

Closed

github-actions bot deleted the gh/yanbing-j/39/head branch August 17, 2025 02:20

Enable TF32 as fp32 internal precision for matmul/linear/conv #157520

Enable TF32 as fp32 internal precision for matmul/linear/conv #157520

Uh oh!

Conversation

yanbing-j commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

pytorch-bot bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157520

✅ You can merge normally! (12 Unrelated Failures)

Uh oh!

mingfeima Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

yanbing-j Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

mingfeima Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

yanbing-j Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

jgong5 Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

yanbing-j Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

yanbing-j commented Jul 8, 2025

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

yanbing-j commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yanbing-j commented Jul 16, 2025

Uh oh!

yanbing-j commented Jul 17, 2025

Uh oh!

pytorchmergebot commented Jul 17, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yanbing-j commented Jul 3, 2025 •

edited

Loading

pytorch-bot bot commented Jul 3, 2025 •

edited

Loading

yanbing-j commented Jul 15, 2025 •

edited

Loading