Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case #140258

sanchitintel · 2024-11-11T09:54:09Z

As suggested by @leslie-fang-intel in leslie-fang-intel@4c83e4e#diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537 - all elements of B tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since B matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance.

Also rectified the size of the dequantized B buffer.

Fixes #140208.

A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

pytorch-bot · 2024-11-11T09:54:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140258

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ You can merge normally! (1 Unrelated Failure)

As of commit 709adfe with merge base fa63276 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/codegen/cpp_micro_gemm.py

leslie-fang-intel · 2024-11-11T13:06:03Z

BTW: I think horizontal transverse doesn't work well with this cache optimization cc @jgong5 @chunyuan-w

torch/_inductor/codegen/cpp_micro_gemm.py

sanchitintel · 2024-11-13T00:23:12Z

BTW: I think horizontal transverse doesn't work well with this cache optimization cc @jgong5 @chunyuan-w

Hi, would the horizontal traverse strategy complement the existing AMX GEMM micro-kernel template (by conditionally using it), or would it replace it? Thanks!

leslie-fang-intel · 2024-11-13T00:37:02Z

Hi, would the horizontal traverse strategy complement the existing AMX GEMM micro-kernel template (by conditionally using it), or would it replace it? Thanks!

I think we will use it conditionally

torch/_inductor/codegen/cpp_micro_gemm.py

jgong5

As we discussed offline, please do not assume the B is contiguous.

torch/_inductor/codegen/cpp_micro_gemm.py

pytorchmergebot · 2024-11-21T21:26:30Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@leslie-fang-intel

As suggested by @leslie-fang-intel in https://github.com/leslie-fang-intel in /pytorch/commit/4c83e4e75138e8fa6e0d58438f75b7718dc8a0cc#diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537

This case cannot be covered by the current UTs, since it hasn't been implemented

This case can't be tested, though, as N != block_n case has not been implemented.

Don't assume weight-packing at GEMM template level

Its value would also be known at runtime, so it wouldn't affect performance

pytorchmergebot · 2024-11-21T21:26:34Z

Successfully rebased sanchitj/simplify_amx_tile_load onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout sanchitj/simplify_amx_tile_load && git pull --rebase)

sanchitintel · 2024-11-21T21:29:29Z

@pytorchbot merge

pytorchmergebot · 2024-11-21T21:31:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@leslie-fang-intel

…rnel for WoQ int8 case (pytorch#140258) As suggested by @leslie-fang-intel in leslie-fang-intel@4c83e4e#diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537 - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance. Also rectified the size of the dequantized B buffer. Fixes pytorch#140208. A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function Pull Request resolved: pytorch#140258 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel

pytorch-bot bot added ciflow/inductor module: inductor labels Nov 11, 2024

pytorchbot added the open source label Nov 11, 2024

sanchitintel commented Nov 11, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

sanchitintel added the topic: bug fixes topic category label Nov 11, 2024

sanchitintel changed the title ~~Simplify B tile loading logic for AMX GEMM micro-kernel for WoQ int8 case~~ Simplify & rectify B tile loading logic for AMX GEMM micro-kernel for WoQ int8 case Nov 11, 2024

sanchitintel added the topic: not user facing topic category label Nov 11, 2024

sanchitintel commented Nov 11, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

sanchitintel requested review from jgong5 and leslie-fang-intel November 11, 2024 10:12

sanchitintel commented Nov 11, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Show resolved Hide resolved

sanchitintel changed the title ~~Simplify & rectify B tile loading logic for AMX GEMM micro-kernel for WoQ int8 case~~ Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case Nov 11, 2024

jgong5 reviewed Nov 12, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

jgong5 requested changes Nov 13, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 14, 2024

sanchitintel commented Nov 18, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

sanchitintel requested a review from jgong5 November 18, 2024 09:15

leslie-fang-intel reviewed Nov 18, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Show resolved Hide resolved

leslie-fang-intel reviewed Nov 18, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Show resolved Hide resolved

leslie-fang-intel reviewed Nov 18, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

sanchitintel requested a review from leslie-fang-intel November 18, 2024 18:50

jgong5 requested changes Nov 19, 2024

View reviewed changes

sanchitintel commented Nov 19, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Show resolved Hide resolved

sanchitintel requested a review from jgong5 November 19, 2024 21:54

leslie-fang-intel requested changes Nov 20, 2024

View reviewed changes

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp_micro_gemm.py Outdated Show resolved Hide resolved

jgong5 requested changes Nov 20, 2024

View reviewed changes

sanchitintel added 15 commits November 21, 2024 21:26

Simplify B tile loading logic for AMX GEMM micro-kernel

bcb7827

As suggested by @leslie-fang-intel in https://github.com/leslie-fang-intel in /pytorch/commit/4c83e4e75138e8fa6e0d58438f75b7718dc8a0cc#diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537

Clarify comment

2517286

Support N != block_n case

397ed7b

This case cannot be covered by the current UTs, since it hasn't been implemented

Change variable name from n to tile_idx for clarity

f0ad985

Fix lint

019a8db

Revise as per review comments

aff78a7

Revise based on review comment

9ca831d

Rectify support of N != block_n case

89d3d36

This case can't be tested, though, as N != block_n case has not been implemented.

Revert load_Dequantized_B to this PR's original implementation

4951365

Revise comments to clarify

8277104

Revert to the second implementation

2f4ecf0

Don't assume weight-packing at GEMM template level

Add a const variable block_n

25767d4

Its value would also be known at runtime, so it wouldn't affect performance

Revise as per review comments

a9b8838

Fix typo

e64fe85

Fix lint

709adfe

pytorchmergebot force-pushed the sanchitj/simplify_amx_tile_load branch from 3dffe41 to 709adfe Compare November 21, 2024 21:26

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 21, 2024

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot added the Merged label Nov 22, 2024

pytorchmergebot closed this in ca9813e Nov 22, 2024

pytorchmergebot removed the merging label Nov 22, 2024

github-actions bot deleted the sanchitj/simplify_amx_tile_load branch December 22, 2024 02:11

sanchitintel mentioned this pull request Jun 9, 2025

High-performance LLM quantization on X86 CPU with native PyTorch #155435

Closed

Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case #140258

Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case #140258

Uh oh!

Conversation

sanchitintel commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140258

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leslie-fang-intel commented Nov 11, 2024

Uh oh!

Uh oh!

sanchitintel commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslie-fang-intel commented Nov 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

sanchitintel commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sanchitintel commented Nov 11, 2024 •

edited

Loading

pytorch-bot bot commented Nov 11, 2024 •

edited

Loading

sanchitintel commented Nov 13, 2024 •

edited

Loading