[CPU][Inductor] Improve performance of A16W8 GEMM template #161148

Xia-Weiwen · 2025-08-21T07:46:13Z

Summary
This PR improves the performance of A16W8 GEMM template by

Removing the config with block_n=48 & block_m=16 as it is not very efficient.
Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31.
Converting int8 values to bf16 with intrinsics instead of at::vec::convert as the latter does not have optimized implementation for this case.

We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct.

Test plan
Already covered by UT.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-08-21T07:46:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161148

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm MI2xx CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

✅ No Failures

As of commit 746766f with merge base fa76256 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

sanchitintel · 2025-08-26T01:38:11Z

Hi,

Converting int8 values to bf16 with intrinsics instead of at::vec::convert as the latter does not have optimized implementation for this case.

Please advise if it's somehow possible to optimize at::vec::convert instead.

Thank you!

Xia-Weiwen · 2025-08-26T01:42:20Z

Hi,

Converting int8 values to bf16 with intrinsics instead of at::vec::convert as the latter does not have optimized implementation for this case.

Please advise if it's possible to somehow optimize at::vec::convert instead.

Thank you!

There is not a specialization for int8->bf16 of at::vec::convert right now. Just need to add it.

Xia-Weiwen · 2025-08-28T02:12:03Z

Hi @CaoE @mingfeima Could you please review? Thanks.

CaoE · 2025-08-28T02:51:42Z

torch/_inductor/codegen/cpp_micro_gemm.py

+            // 4) Convert to f32
+            __m512 f32 = _mm512_cvtepi32_ps(v32);
+            // 5) Convert f32 -> bf16 (round-to-nearest-even)
+            __m256i bf16 = (__m256i)_mm512_cvtneps_pbh(f32);


Since intrinsic is used, it is better to check whether the compiler supports these, e.g., _mm512_cvtneps_pbh.
If the compiler does not support, it will choose aten linear and lose the opportunity of using AMX microgemm.
Maybe we can do like #147368.

BTW, which versions of the compiler support these instructions?

Updated. Thanks.

CaoE · 2025-08-28T08:07:13Z

torch/_inductor/cpu_vec_isa.py

 @dataclasses.dataclass
 class VecAMX(VecAVX512):
-    _arch_flags = VecAVX512._arch_flags + " -mamx-tile -mamx-bf16 -mamx-int8"
+    _arch_flags = VecAVX512().build_arch_flags() + " -mamx-tile -mamx-bf16 -mamx-int8"


Could you please double check whether VecAVX512().build_arch_flags() will do self.check_build(VecAMX._avx512_bf16_code) ?

Unfortunately no. I have updated this part to ensure we get the correct flags. Thanks.

Xia-Weiwen · 2025-08-31T09:49:08Z

@pytorchbot merge

pytorchmergebot · 2025-08-31T09:50:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…61148) **Summary** This PR improves the performance of A16W8 GEMM template by - Removing the config with block_n=48 & block_m=16 as it is not very efficient. - Using AMX microkernel when M >= 5 so that we use AMX instead of AVX512 for M=5~31. - Converting int8 values to bf16 with intrinsics instead of `at::vec::convert` as the latter does not have optimized implementation for this case. We saw up to >10% performance gain in various cases of running Llama-3.1-8b-instruct. **Test plan** Already covered by UT. Pull Request resolved: pytorch#161148 Approved by: https://github.com/CaoE, https://github.com/jansel

zou3519 · 2025-10-08T15:23:53Z

@Xia-Weiwen @CaoE @jansel does this PR improve CPU-only llama3 performance, or does it also affect llama3 running on CUDA? We're seeing something weird where this PR appears to affect llama4 performance on CUDA (maybe there are some cpu pieces in there, I"m not sure)

…ytorch#161148)" This reverts commit 75bc23c.

Xia-Weiwen · 2025-10-09T02:11:17Z

@zou3519 It should not affect cuda. It's for CPU only. It has no effect unless you run A16W8 (bf16-int8) GEMMs on CPU with AMX.

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 21, 2025

pytorchbot added the open source label Aug 21, 2025

Xia-Weiwen added topic: not user facing topic category intel This tag is for PR from Intel labels Aug 21, 2025

Xia-Weiwen added 3 commits August 21, 2025 15:39

[CPU][Inductor] Improve performance of A16W8 GEMM template

8fa57e1

Merge branch 'main' into improve_a16w8

5cf00b1

fix incorrect changes

ec510aa

Xia-Weiwen requested review from CaoE, mingfeima and sanchitintel August 26, 2025 01:30

Merge branch 'main' into improve_a16w8

84bb4ad

CaoE reviewed Aug 28, 2025

View reviewed changes

Xia-Weiwen requested a review from CaoE August 28, 2025 07:46

CaoE reviewed Aug 28, 2025

View reviewed changes

Xia-Weiwen requested a review from CaoE August 28, 2025 08:52

Xia-Weiwen added 4 commits August 28, 2025 15:41

Add ISA check

fb69a6b

Merge branch 'main' into improve_a16w8

5653572

Update build_arch_flags for VecAMX

3cd993a

Fix a typo

746766f

Xia-Weiwen marked this pull request as ready for review August 29, 2025 01:32

CaoE approved these changes Aug 29, 2025

View reviewed changes

CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 29, 2025

Xia-Weiwen requested a review from jansel August 29, 2025 05:24

jansel approved these changes Aug 30, 2025

View reviewed changes

pytorchmergebot added the merging label Aug 31, 2025

pytorchmergebot closed this in 75bc23c Aug 31, 2025

pytorchmergebot added Merged and removed merging labels Aug 31, 2025

zou3519 mentioned this pull request Oct 8, 2025

[Bug]: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 vllm bench throughput regression on 2.9 RC on B200 vllm-project/vllm#26320

Open

1 task

atalman added a commit to atalman/pytorch that referenced this pull request Oct 8, 2025

Revert "[CPU][Inductor] Improve performance of A16W8 GEMM template (p…

ae9a383

…ytorch#161148)" This reverts commit 75bc23c.

atalman added a commit to atalman/pytorch that referenced this pull request Oct 8, 2025

Revert "[CPU][Inductor] Improve performance of A16W8 GEMM template (p…

44f138c

…ytorch#161148)" This reverts commit 75bc23c.

[CPU][Inductor] Improve performance of A16W8 GEMM template #161148

[CPU][Inductor] Improve performance of A16W8 GEMM template #161148

Uh oh!

Conversation

Xia-Weiwen commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161148

❗ 1 Active SEVs

✅ No Failures

Uh oh!

sanchitintel commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Weiwen commented Aug 26, 2025

Uh oh!

Xia-Weiwen commented Aug 28, 2025

Uh oh!

CaoE Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

CaoE Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen commented Aug 31, 2025

Uh oh!

pytorchmergebot commented Aug 31, 2025

Merge started

Uh oh!

zou3519 commented Oct 8, 2025

Uh oh!

Xia-Weiwen commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Xia-Weiwen commented Aug 21, 2025 •

edited

Loading

pytorch-bot bot commented Aug 21, 2025 •

edited

Loading

sanchitintel commented Aug 26, 2025 •

edited

Loading

CaoE Aug 28, 2025 •

edited

Loading