ln + fp8 quant benchmark #109765

ipiszy · 2023-09-21T02:48:21Z

Benchmark results:

Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.767714285714286ms, Eager: 0.8525048543689321ms.
Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11353932584269663ms, Eager: 0.8525339805825243ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.859285714285713ms, Eager: 0.8535865384615384ms.
Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11312808988764046ms, Eager: 0.8531844660194176ms.

When "split_reductions" is disabled, I expect that there is only one fused kernel. However now there are still two kernels (for LN and amax calculation). This is the generated code: https://gist.github.com/ipiszy/6ab2c86ba211240d606edfab8b14e7bd. Kernel 1: block size: {XBLOCK: 1, RBLOCK: 4096}. Kernel 2: block size: {XBLOCK: 1, RBLOCK: 2048}

I think ideally, there should be 2 kernels: the first kernel does block-level amax calculation, and the second kernel aggregates amax results from the first kernel.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @ngimel

[ghstack-poisoned]

pytorch-bot · 2023-09-21T02:48:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109765

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit b78a428 with merge base 59592ce ():

NEW FAILURE - The following job has failed:

Lint / lintrunner / linux-job (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Benchmark results: ``` Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.767714285714286ms, Eager: 0.8525048543689321ms. Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11353932584269663ms, Eager: 0.8525339805825243ms. Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.859285714285713ms, Eager: 0.8535865384615384ms. Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11312808988764046ms, Eager: 0.8531844660194176ms. ``` When "split_reductions" is disabled, I expect that there is only one fused kernel. However now there are still two kernels (for LN and amax calculation). This is the generated code: https://gist.github.com/ipiszy/6ab2c86ba211240d606edfab8b14e7bd. Kernel 1: block size: {XBLOCK: 1, RBLOCK: 4096}. Kernel 2: block size: {XBLOCK: 1, RBLOCK: 2048} I think ideally, there should be 2 kernels: the first kernel does block-level amax calculation, and the second kernel aggregates amax results from the first kernel. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 9d51839 Pull Request resolved: #109765

Benchmark results: ``` Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.767714285714286ms, Eager: 0.8525048543689321ms. Config: float8_dtype=torch.float8_e4m3fn, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11353932584269663ms, Eager: 0.8525339805825243ms. Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=False. Benchmark results: Inductor: 12.859285714285713ms, Eager: 0.8535865384615384ms. Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096), enable_split_reductions=True. Benchmark results: Inductor: 0.11312808988764046ms, Eager: 0.8531844660194176ms. ``` When "split_reductions" is disabled, I expect that there is only one fused kernel. However now there are still two kernels (for LN and amax calculation). This is the generated code: https://gist.github.com/ipiszy/6ab2c86ba211240d606edfab8b14e7bd. Kernel 1: block size: {XBLOCK: 1, RBLOCK: 4096}. Kernel 2: block size: {XBLOCK: 1, RBLOCK: 2048} I think ideally, there should be 2 kernels: the first kernel does block-level amax calculation, and the second kernel aggregates amax results from the first kernel. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ln + fp8 quant benchmark

5fbb08c

[ghstack-poisoned]

ipiszy mentioned this pull request Sep 21, 2023

Basic fp8 support in Inductor #109168

Closed

pytorch-bot bot added the topic: not user facing topic category label Sep 21, 2023

ipiszy mentioned this pull request Sep 21, 2023

ln + amax + fp8 quant inductor enablement #109301

Closed

github-actions bot added the module: inductor label Sep 21, 2023

ipiszy added a commit that referenced this pull request Sep 21, 2023

ln + fp8 quant benchmark

feaa691

ghstack-source-id: 9d51839 Pull Request resolved: #109765

ipiszy mentioned this pull request Oct 10, 2023

debug #110930

Closed

github-actions bot added the ciflow/inductor label Oct 10, 2023

ipiszy closed this Oct 20, 2023

facebook-github-bot deleted the gh/ipiszy@gmail.com/9/head branch November 19, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ln + fp8 quant benchmark #109765

ln + fp8 quant benchmark #109765

Uh oh!

ipiszy commented Sep 21, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ln + fp8 quant benchmark #109765

ln + fp8 quant benchmark #109765

Uh oh!

Conversation

ipiszy commented Sep 21, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109765

❌ 1 New Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ipiszy commented Sep 21, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 21, 2023 •

edited

Loading