KEMBAR78
Enable TF32 as fp32 internal precision for matmul/linear/conv by yanbing-j · Pull Request #157520 · pytorch/pytorch · GitHub
Skip to content

Conversation

@yanbing-j
Copy link
Collaborator

@yanbing-j yanbing-j commented Jul 3, 2025

Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in mkldnn backend. Since we have refined fp32 precision API in #125888, we can easily extend the API to support TF32 for mkldnn backend.

torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"

Related kernel update and UTs update are done. And the wrapper bf32_on_and _off is updated to reduced_f32_on_and_off, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including bf32 ON and tf32 ON).

Stack from ghstack (oldest at bottom):

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @Guobing-Chen @Xia-Weiwen @snadampal @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157520

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (12 Unrelated Failures)

As of commit 342dc1f with merge base d7e1b8b (image):

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration release notes: linalg_frontend release notes category labels Jul 3, 2025
@yanbing-j yanbing-j marked this pull request as draft July 3, 2025 01:32
@yanbing-j yanbing-j changed the title Enable TF32 as fp32 internal precision for matmul Enable TF32 as fp32 internal precision for matmul/linear/conv Jul 3, 2025
[ghstack-poisoned]
[ghstack-poisoned]
@yanbing-j yanbing-j added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ci-no-td Do not run TD on this PR ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Jul 3, 2025
@yanbing-j yanbing-j requested a review from mingfeima July 3, 2025 01:56
float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "ieee");
setFloat32Precision("mkldnn", "matmul", "tf32");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't quite understand what this change means, "ieee" to "tf32"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description of #125888, it says,

We provide 3 fp32 compute precision can be set:
"ieee": Not allowed to use any other internal computation data types .
"tf32": Allowed to use tf32 as internal computation data types.
"bf16": Allowed to use bf16 as internal computation data types.
"none": Precision's are not set. Can be override by its father node.

"HIGHEST, HIGH, MEDIUM" is a legacy representation, means ieee, tf32, and bf16.

So without this PR, mkldnn backend only supports ieee, bf16 and none. If set to HIGH, tf32 is not supported in mkldnn, use ieee instead. With this PR, tf32 is supported in mkldnn, so we can use tf32 directly.

Copy link
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some suggestions for simplifying the code.

mat2.numel() != 0 &&
checksize(mat1, mat2));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file has multiple functions that have similar usage:
use_mkldnn_bf16_matmul, use_mkldnn_fp16_matmul, use_mkldnn_bf32_matmul and use_mkldnn_tf32_matmul

can we templatize it to simplify the code?

template <typename T>
bool use_mkldnn_matmul();

#if defined(__aarch64__)
bool use_mkldnn_matmul<at::BFloat16>();
#endif

const Tensor& result) {
return (use_mkldnn_bf16_matmul(mat1, mat2, result) || use_mkldnn_fp16_matmul(mat1, mat2, result) || use_mkldnn_bf32_matmul(mat1, mat2, result));
return (use_mkldnn_bf16_matmul(mat1, mat2, result) || use_mkldnn_fp16_matmul(mat1, mat2, result) || use_mkldnn_bf32_matmul(mat1, mat2, result) || use_mkldnn_tf32_matmul(mat1, mat2, result));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can template use_mkldnn_matmul<>, then you can do something like:

AT_DISPATCH_FLOATING_AND2(kBFloat16, kHalf, ..., [&] {
  return use_mkldnn_matmul<scalar_t>(...);
});

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please take a look again.

yanbing-j added a commit that referenced this pull request Jul 3, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: fc7fed1
Pull Request resolved: #157520
[ghstack-poisoned]
const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
{"generic", {{"ieee", "tf32", "bf16", "none"}}},
{"mkldnn", {{"ieee", "bf16", "none"}}},
{"mkldnn", {{"ieee", "bf16", "tf32", "none"}}},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the ordering different from "generic"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason, I will change it same as generic.

yanbing-j added a commit that referenced this pull request Jul 3, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: bd019ac
Pull Request resolved: #157520
[ghstack-poisoned]
yanbing-j added a commit that referenced this pull request Jul 8, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: 4901b16
Pull Request resolved: #157520
[ghstack-poisoned]
@yanbing-j yanbing-j requested a review from jansel July 8, 2025 08:32
@yanbing-j yanbing-j marked this pull request as ready for review July 8, 2025 08:32
@yanbing-j
Copy link
Collaborator Author

Hi @jansel , could you please also take a look at this PR? Similar as the previous PRs to enable BF32, we can easily extend the API to support TF32 for mkldnn backend. Thanks!

yanbing-j added a commit that referenced this pull request Jul 14, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: 5365a8c
Pull Request resolved: #157520
yanbing-j added a commit to yanbing-j/pytorch that referenced this pull request Jul 14, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: 5365a8c
Pull Request resolved: pytorch#157520
[ghstack-poisoned]
Copy link
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failures?

@yanbing-j
Copy link
Collaborator Author

yanbing-j commented Jul 15, 2025

Test failures?

The failures appear after I rebased yesterday, and it is instresting that the failures are same as #158209, which only updates warning log. Let me try to find the point.

Update: The failures are caused by #150762. See #150762 (comment)
I have do the rebase.

[ghstack-poisoned]
yanbing-j added a commit that referenced this pull request Jul 16, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: 17e8040
Pull Request resolved: #157520
[ghstack-poisoned]
@yanbing-j
Copy link
Collaborator Author

Hi @jansel , CI all passes after rebase. Could you please take a look again? Thanks!

yanbing-j added a commit that referenced this pull request Jul 17, 2025
Enable TF32 as fp32 internal precision for Linear

Enable TF32 as fp32 internal precision for conv

ghstack-source-id: cc560bf
Pull Request resolved: #157520
[ghstack-poisoned]
@yanbing-j
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Aug 4, 2025
…9024)

This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix #159031 and #159551.

Pull Request resolved: #159024
Approved by: https://github.com/mingfeima
@github-actions github-actions bot deleted the gh/yanbing-j/39/head branch August 17, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor module: mkldnn Related to Intel IDEEP or oneDNN (a.k.a. mkldnn) integration open source release notes: linalg_frontend release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants