[cpu] add sdpa choice and UT #105131

Valentine233 · 2023-07-13T07:24:37Z

Stack from ghstack (oldest at bottom):

Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen.

Performance of the stack

NanoGPT's SDPA kernel

Using benchmark repo, with one socket.
Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64.
Machine: SPR.

Dtype	Causal	Mode	SDPA	Time (ms per iter)	Speedup
float32	FALSE	Inference	Unfused	3.081
			Flash attention	1.665	1.85045
float32	TRUE	Inference	Unfused	3.463
			Flash attention	1.662	2.083634
bfloat16	FALSE	Inference	Unfused	1.203
			Flash attention	1.154	1.042461
bfloat16	TRUE	Inference	Unfused	1.543
			Flash attention	1.154	1.337088
float32	FALSE	Training	Unfused	54.938
			Flash attention	23.029	2.385601
float32	TRUE	Training	Unfused	58.266
			Flash attention	17.835	3.266947
bfloat16	FALSE	Training	Unfused	18.924
			Flash attention	18.886	1.002012
bfloat16	TRUE	Training	Unfused	21.08
			Flash attention	14.172	1.48744

Stable Diffusion

Following model's BKM.
Mode: Inference; Machine: SPR.

Dtype	SDPA	Throughput (fps)	Speedup SDPA	Total Time (ms)	Speedup
float32	Unfused	1.63		1139
	Flash attention	1.983	1.216564	547.488	2.080411
bfloat16	Flash attention in IPEX	4.784		429.051
	Flash attention	4.857	1.015259	408.823	1.049479

LLM models of Torchbench

Dtype: float32; Mode: Inference, single socket; Machine: CPX.

Model name	SDPA	Inductor_new	Inductor_old	Inductor Ratio(old/new)
hf_Albert	Unfused -> Flash attention	0.048629309	0.05591545	1.14983024
hf_Bert	Unfused -> Flash attention	0.053156243	0.060732115	1.142520841
hf_Bert_large	Unfused -> Flash attention	0.141089502	0.155190077	1.099940636
llama	Unfused -> Flash attention	0.033250106	0.033720745	1.01415451

Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR.

Model name	SDPA	Inductor_new	Inductor_old	Inductor Ratio(old/new)
hf_Albert	Unfused -> Flash attention	0.020681298	0.020718282	1.001788324
hf_Bert	Unfused -> Flash attention	0.019932816	0.019935424	1.000130842
hf_Bert_large	Unfused -> Flash attention	0.047949174	0.048312502	1.007577355
llama	Unfused -> Flash attention	0.018528057	0.01861126	1.0044907

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

pytorch-bot · 2023-07-13T07:24:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105131

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long ROCm queue (2023-08-18)

✅ No Failures

As of commit 7be93a6 with merge base 600f9ef ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 843f960 Pull Request resolved: #105131

ghstack-source-id: 9029c8a Pull Request resolved: #105131

[ghstack-poisoned]

ghstack-source-id: 0ace3c1 Pull Request resolved: #105131

[ghstack-poisoned]

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

ghstack-source-id: 2ee9b45 Pull Request resolved: #105131

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

ghstack-source-id: 889570e Pull Request resolved: #105131

Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 5a9be05 Pull Request resolved: #105131

Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 8cc8670 Pull Request resolved: #105131

Feature RFC: pytorch/rfcs#56. Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen. ## Performance of the stack ### NanoGPT's SDPA kernel Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket. Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64. Machine: SPR. | Dtype | Causal | Mode | SDPA | Time (ms per iter) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | FALSE | Inference | Unfused | 3.081 | | | | | | Flash attention | 1.665 | **1.85045** | | float32 | TRUE | Inference | Unfused | 3.463 | | | | | | Flash attention | 1.662 | **2.083634**| | bfloat16 | FALSE | Inference | Unfused | 1.203 | | | | | | Flash attention | 1.154 | **1.042461**| | bfloat16 | TRUE | Inference | Unfused | 1.543 | | | | | | Flash attention | 1.154 | **1.337088**| | float32 | FALSE | Training | Unfused | 54.938 | | | | | | Flash attention | 23.029 | **2.385601**| | float32 | TRUE | Training | Unfused | 58.266 | | | | | | Flash attention | 17.835 | **3.266947**| | bfloat16 | FALSE | Training | Unfused | 18.924 | | | | | | Flash attention | 18.886 | **1.002012**| | bfloat16 | TRUE | Training | Unfused | 21.08 | | | | | | Flash attention | 14.172 | **1.48744** | ### Stable Diffusion Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md). Mode: Inference; Machine: SPR. | Dtype | SDPA | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup | | -------- | -------- | ------- | ------- | ------- | ------- | | float32 | Unfused | 1.63 | | 1139 | | | | Flash attention | 1.983 | 1.216564 | 547.488 | **2.080411**| | bfloat16 | Flash attention in IPEX | 4.784 | | 429.051 | | | | Flash attention | 4.857 | 1.015259 | 408.823 | **1.049479**| ### LLM models of Torchbench Dtype: float32; Mode: Inference, single socket; Machine: CPX. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024** hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841** hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636** llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451** Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR. Model name | SDPA | Inductor_new | Inductor_old | Inductor Ratio(old/new) -- | -- | -- | -- | -- hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324** hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842** hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355** llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907** cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Valentine233 · 2023-08-20T08:54:32Z

@pytorchbot merge

pytorchmergebot · 2023-08-20T08:56:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) [ghstack-poisoned]

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Pull Request resolved: #108180 `scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. ghstack-source-id: 199140502 @exported-using-ghexport Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ipiszy ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Pull Request resolved: #108180 `scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. ghstack-source-id: 199155539 @exported-using-ghexport Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/)

`scaled_dot_product_attention` used to be decomposed in pre-autograd, given that it calls `_scaled_dot_product_attention_math` and `_scaled_dot_product_attention_math` only has a `CompositeImplicitAutograd` kernel. As a result it's decomposed into ops with finer granularity. However recent PRs (#103826 #105131) added new logic in `scaled_dot_product_attention` and now it calls `_scaled_dot_product_flash_attention` which contains a CPU kernel. This results in `_scaled_dot_product_flash_attention` showing up in `torch.export()`. This PR adds a decomposition that ensures `scaled_dot_product_attention` is still being decomposed the same way as before, i.e., going through `_scaled_dot_product_attention_math`. Notice that this decomp rule should be excluded by inductor. Differential Revision: [D48762000](https://our.internmc.facebook.com/intern/diff/D48762000/) Pull Request resolved: #108180 Approved by: https://github.com/SherlockNoMad

Valentine233 marked this pull request as draft July 13, 2023 07:29

pytorchbot added the open source label Jul 13, 2023

Valentine233 added a commit that referenced this pull request Jul 13, 2023

[cpu] add sdpa choice and UT

584a035

ghstack-source-id: 843f960 Pull Request resolved: #105131

Valentine233 added a commit that referenced this pull request Jul 13, 2023

[cpu] add sdpa choice and UT

c17bf26

ghstack-source-id: 9029c8a Pull Request resolved: #105131

Valentine233 added 6 commits July 13, 2023 15:16

[cpu] add sdpa choice and UT

422eeef

[ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

dc82d6b

[ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

aacd345

[ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

320e7c8

[ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

1766cb5

[ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

3e0183c

[ghstack-poisoned]

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jul 14, 2023

Valentine233 added a commit that referenced this pull request Jul 14, 2023

[cpu] add sdpa choice and UT

0a49331

ghstack-source-id: 0ace3c1 Pull Request resolved: #105131

Valentine233 added 2 commits July 14, 2023 17:48

Update on "[cpu] add sdpa choice and UT"

430c4e3

[ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

10704bb

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Valentine233 added a commit that referenced this pull request Jul 15, 2023

[cpu] add sdpa choice and UT

bfebfc7

ghstack-source-id: 2ee9b45 Pull Request resolved: #105131

Valentine233 added 2 commits July 15, 2023 13:31

Update on "[cpu] add sdpa choice and UT"

502f0c8

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Update on "[cpu] add sdpa choice and UT"

979a990

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 [ghstack-poisoned]

Valentine233 added a commit that referenced this pull request Jul 17, 2023

[cpu] add sdpa choice and UT

e2afcdf

ghstack-source-id: 889570e Pull Request resolved: #105131

Valentine233 added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Aug 18, 2023

Valentine233 added a commit that referenced this pull request Aug 18, 2023

[cpu] add sdpa choice and UT

8f902c1

ghstack-source-id: 5a9be05 Pull Request resolved: #105131

Valentine233 added a commit that referenced this pull request Aug 19, 2023

[cpu] add sdpa choice and UT

17e850f

ghstack-source-id: 8cc8670 Pull Request resolved: #105131

pytorchmergebot added the merging label Aug 20, 2023

pytorchmergebot added Merged and removed merging labels Aug 20, 2023

pytorchmergebot closed this in 71632d4 Aug 20, 2023

facebook-github-bot deleted the gh/Valentine233/5/head branch August 23, 2023 14:16

larryliu0820 mentioned this pull request Aug 29, 2023

[pytorch] Add decomp rule for scaled_dot_product_attention #108180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cpu] add sdpa choice and UT #105131

[cpu] add sdpa choice and UT #105131

Uh oh!

Valentine233 commented Jul 13, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 13, 2023 •

edited

Loading

Uh oh!

Valentine233 commented Aug 20, 2023

Uh oh!

pytorchmergebot commented Aug 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[cpu] add sdpa choice and UT #105131

[cpu] add sdpa choice and UT #105131

Uh oh!

Conversation

Valentine233 commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance of the stack

NanoGPT's SDPA kernel

Stable Diffusion

LLM models of Torchbench

Uh oh!

pytorch-bot bot commented Jul 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105131

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Valentine233 commented Aug 20, 2023

Uh oh!

pytorchmergebot commented Aug 20, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Valentine233 commented Jul 13, 2023 •

edited

Loading

pytorch-bot bot commented Jul 13, 2023 •

edited

Loading