KEMBAR78

Add fp16 support for gemm on CPU by CaoE · Pull Request #99498 · pytorch/pytorch · GitHub

Add fp16 support for gemm on CPU #99498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

CaoE wants to merge 85 commits into gh/CaoE/19/base from gh/CaoE/19/head

Collaborator

CaoE commented Apr 19, 2023 •

edited by pytorch-bot bot

Loading

Stack from ghstack (oldest at bottom):

-> Add fp16 support for gemm on CPU #99498

Testing

Native matmul vs. mkldnn matmul on SPR (with avx512_fp16 support)

single core:

Input	Naïve impl / ms	oneDNN / ms	Speed up
M: 128, N: 128, K: 128, trans_a: False, trans_b: False	2010.387	64.700	31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False	4027.116	107.780	37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False	28685868.488	90663.008	316.401

56 cores:

Input	Naïve impl / ms	oneDNN / ms	Speed up
M: 128, N: 128, K: 128, trans_a: False, trans_b: False	5.091	0.24	211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True	5.224	0.23	220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False	10.006	0.30	330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False	29435.372	1.770	1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True	31464.961	1.728	18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False	115035.849	7.990	14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True	122981.023	7.725	15918.34
Batch: 768, M: 128, N: 64, K: 128	2032.523	0.705	2882.23

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @ngimel @desertfire


          add fp16 support for gemm

813739c

[ghstack-poisoned]

CaoE requested review from IvanYashchuk, lezcano and nikitaved as code owners

April 19, 2023 01:35

CaoE mentioned this pull request

update onednn to 3.1 #99495

Closed

pytorch-bot bot commented Apr 19, 2023 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99498

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 3ae0be7 with merge base 5dcee01 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added the release notes: linalg_frontend label

This was referenced Apr 19, 2023

add fp16 support for mkldnn conv and deconv on CPU #99496

Closed

add fp16 support for native conv and deconv on CPU #99497

Closed

github-actions bot added the module: cpu label

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

1773fdf

ghstack-source-id: a911370
Pull Request resolved: #99498

CaoE marked this pull request as draft

April 19, 2023 01:36

pytorchbot added the open source label

Collaborator

mingfeima commented Apr 21, 2023

So once the failures are fixed, we shall provide some basic benchmark numbers.


          Update on "add fp16 support for gemm"

04cf1d9

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

b947acb

ghstack-source-id: 8df672a
Pull Request resolved: #99498


          Update on "add fp16 support for gemm"

1547eb6

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

e6c5ebb

ghstack-source-id: af13ff2
Pull Request resolved: #99498


          Update on "add fp16 support for gemm"

a821dba

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

1807f13

ghstack-source-id: 763ac89
Pull Request resolved: #99498


          Update on "add fp16 support for gemm"

8416f19

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

f3e92c6

ghstack-source-id: eed6a6c
Pull Request resolved: #99498


          Update on "add fp16 support for gemm"

a4c37a7

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]


          Update on "add fp16 support for gemm"

b45efde

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]


          Update on "add fp16 support for gemm"

aa05dec

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

d753d1a

ghstack-source-id: bca5db3
Pull Request resolved: #99498


          Update on "add fp16 support for gemm"

522c793

Not complete yet. WIP.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

c19c46f

ghstack-source-id: 63c9606
Pull Request resolved: #99498


          Update on "add fp16 support for gemm"

19b3769



### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401


56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov desertfire

[ghstack-poisoned]


          Update on "add fp16 support for gemm"

b3259d0



### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401


56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov desertfire

[ghstack-poisoned]


          Update on "add fp16 support for gemm"

56bd623



### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401


56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov desertfire

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

264167c

ghstack-source-id: 2b527d1
Pull Request resolved: #99498

CaoE requested a review from cpuhrsch

September 25, 2023 07:05

Collaborator Author

CaoE commented Sep 25, 2023

@cpuhrsch Could you please review this PR ? Thank you.

malfet approved these changes

View reviewed changes

aten/src/ATen/native/mkldnn/Linear.cpp

    
                      "mkldnn_linear: bf16 path needs the cpu support avx_ne_convert or avx512bw, avx512vl and avx512dq");

                } else if (self.scalar_type() == ScalarType::Half) {

                  TORCH_CHECK(mkldnn_fp16_device_check(),

                      "mkldnn_linear: fp16 path needs the cpu support avx_ne_convert or avx512_fp16");

Contributor

malfet Sep 25, 2023

Hmm, is this a correct statement? ARM CPUs support half precision operations. Aren't mkldnn support those?

Collaborator Author

CaoE Sep 26, 2023 •

edited

Loading

Currently, we are focused on x64. Not sure how complete the support is on ARM. It may be done in later PRs

aten/src/ATen/native/mkldnn/Matmul.cpp Outdated

Comment on lines 82 to 84

    
                return (

                    at::globalContext().userEnabledMkldnn() &&

                    mkldnn_fp16_device_check());

Contributor

malfet Sep 25, 2023

Please remove extraneous brackets (not sure about the rest of the formatting)

Suggested change

      
              return (
          
                  at::globalContext().userEnabledMkldnn() &&
          
                  mkldnn_fp16_device_check());
          
              return at::globalContext().userEnabledMkldnn() &&
          
                  mkldnn_fp16_device_check();

Collaborator Author

CaoE Sep 26, 2023

Revised as suggested

aten/src/ATen/native/mkldnn/Matmul.cpp Outdated

    
                  TORCH_CHECK(mkldnn_bf16_device_check(),

                  "mkldnn_matmul: mkldnn_matmul bf16 path needs the cpu support avx_ne_convert or avx512bw, avx512vl and avx512dq, or AWS Graviton3");

                } else {

                  TORCH_CHECK(mkldnn_fp16_device_check(),

Contributor

malfet Sep 25, 2023

Would be nice to check that it's called only for those two dtypes...

Suggested change

      
                TORCH_CHECK(mkldnn_fp16_device_check(),
          
                TORCH_DEBUG_ASSERT(mat1.scalar_type() == at::kHalf); 
          
                TORCH_CHECK(mkldnn_fp16_device_check(),

Collaborator Author

CaoE Sep 26, 2023

Revised as suggested

aten/src/ATen/native/mkldnn/Linear.cpp

    
                if (self.scalar_type() == ScalarType::BFloat16) {

                  TORCH_CHECK(mkldnn_bf16_device_check(),

                      "mkldnn_linear: bf16 path needs the cpu support avx512bw, avx512vl and avx512dq");

                      "mkldnn_linear: bf16 path needs the cpu support avx_ne_convert or avx512bw, avx512vl and avx512dq");

Contributor

malfet Sep 25, 2023

Looks like mkldnn_matmul is supported on ARM devices. Are you sure about linter being the exception?

Collaborator Author

CaoE Sep 26, 2023

I split bf16 check into bf16 check on x64 and on ARM.
I kept the previous message (adding a new isa avx_ne_convert), which also does not mention ARM. Do you think we should include ARM info. in all of such messages (include matmul, conv and deconv).
https://github.com/pytorch/pytorch/pull/99498/files#diff-dd15cec62ddb24d690b58e1902d8822347003437c9c4c4ae51f9c02e281fa33eR88

CaoE added a commit to CaoE/pytorch that referenced this pull request


          add fp16 support for gemm

0d20cde

ghstack-source-id: 2b527d1
Pull Request resolved: pytorch#99498

CaoE added a commit to CaoE/pytorch that referenced this pull request


          add fp16 support for gemm

caee393

ghstack-source-id: 2b527d1
Pull Request resolved: pytorch#99498


          Update on "add fp16 support for gemm"

e93027e



### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401


56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov desertfire

[ghstack-poisoned]

CaoE added the ciflow/slow label


          Update on "add fp16 support for gemm"

3d0e208



### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401


56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov desertfire

[ghstack-poisoned]


          Update on "add fp16 support for gemm"

3ae0be7



### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401


56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov desertfire

[ghstack-poisoned]

CaoE added a commit that referenced this pull request


          add fp16 support for gemm

4006c13

ghstack-source-id: dd8ff80
Pull Request resolved: #99498

Collaborator Author

CaoE commented Sep 28, 2023

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Sep 28, 2023

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added Merged and removed merging labels

pytorchmergebot closed this in

9399e0b

facebook-github-bot deleted the gh/CaoE/19/head branch

October 1, 2023 14:23

lezcano changed the title ~~add fp16 support for gemm~~ Add fp16 support for gemm on CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

malfet malfet approved these changes

jgong5 jgong5 approved these changes

lezcano Awaiting requested review from lezcano

nikitaved Awaiting requested review from nikitaved

IvanYashchuk Awaiting requested review from IvanYashchuk

mingfeima Awaiting requested review from mingfeima

kulinseth Awaiting requested review from kulinseth

mruberry Awaiting requested review from mruberry

ngimel Awaiting requested review from ngimel

BowenBao Awaiting requested review from BowenBao

abock Awaiting requested review from abock

thiagocrepaldi Awaiting requested review from thiagocrepaldi

wschin Awaiting requested review from wschin

cpuhrsch Awaiting requested review from cpuhrsch

Labels

ciflow/inductor ciflow/mps ciflow/periodic ciflow/slow ciflow/trunk Merged module: cpu module: half module: inductor open source release notes: linalg_frontend