KEMBAR78
ln + fp8 quant benchmark by ipiszy · Pull Request #109765 · pytorch/pytorch · GitHub
Skip to content

Conversation

@ipiszy
Copy link
Contributor

@ipiszy ipiszy commented Sep 21, 2023

Benchmark results:

Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.767714285714286ms, Eager: 0.8525048543689321ms.
Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11353932584269663ms, Eager: 0.8525339805825243ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.859285714285713ms, Eager: 0.8535865384615384ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11312808988764046ms, Eager: 0.8531844660194176ms.

When "split_reductions" is disabled, I expect that there is only one fused kernel. However now there are still two kernels (for LN and amax calculation). This is the generated code: https://gist.github.com/ipiszy/6ab2c86ba211240d606edfab8b14e7bd. Kernel 1: block size: {XBLOCK: 1, RBLOCK: 4096}. Kernel 2: block size: {XBLOCK: 1, RBLOCK: 2048}

I think ideally, there should be 2 kernels: the first kernel does block-level amax calculation, and the second kernel aggregates amax results from the first kernel.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @ngimel

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 21, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109765

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit b78a428 with merge base 59592ce (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Benchmark results:
```
Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.767714285714286ms, Eager: 0.8525048543689321ms.
Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11353932584269663ms, Eager: 0.8525339805825243ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.859285714285713ms, Eager: 0.8535865384615384ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11312808988764046ms, Eager: 0.8531844660194176ms.
```

When "split_reductions" is disabled, I expect that there is only one fused kernel. However now there are still two kernels (for LN and amax calculation). This is the generated code: https://gist.github.com/ipiszy/6ab2c86ba211240d606edfab8b14e7bd. Kernel 1: block size: {XBLOCK: 1, RBLOCK: 4096}. Kernel 2: block size: {XBLOCK: 1, RBLOCK: 2048}

I think ideally, there should be 2 kernels: the first kernel does block-level amax calculation, and the second kernel aggregates amax results from the first kernel.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
ipiszy added a commit that referenced this pull request Sep 21, 2023
ghstack-source-id: 9d51839
Pull Request resolved: #109765
Benchmark results:
```
Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.767714285714286ms, Eager: 0.8525048543689321ms.
Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11353932584269663ms, Eager: 0.8525339805825243ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.859285714285713ms, Eager: 0.8535865384615384ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11312808988764046ms, Eager: 0.8531844660194176ms.
```

When "split_reductions" is disabled, I expect that there is only one fused kernel. However now there are still two kernels (for LN and amax calculation). This is the generated code: https://gist.github.com/ipiszy/6ab2c86ba211240d606edfab8b14e7bd. Kernel 1: block size: {XBLOCK: 1, RBLOCK: 4096}. Kernel 2: block size: {XBLOCK: 1, RBLOCK: 2048}

I think ideally, there should be 2 kernels: the first kernel does block-level amax calculation, and the second kernel aggregates amax results from the first kernel.




cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov

[ghstack-poisoned]
@ipiszy ipiszy mentioned this pull request Oct 10, 2023
@ipiszy ipiszy closed this Oct 20, 2023
@facebook-github-bot facebook-github-bot deleted the gh/ipiszy@gmail.com/9/head branch November 19, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant