Basic fp8 support in Inductor #109168

ipiszy · 2023-09-13T03:40:37Z

Add basic fp8 support in Inductor, including:

Fix fp8 Triton codegen issues;
Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10.

Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

[ghstack-poisoned]

pytorch-bot · 2023-09-13T03:40:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109168

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 76b4cf8 with merge base 6b7b9c7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 3a3e05e Pull Request resolved: #109168

torch/utils/_sympy/value_ranges.py

eellison

Would you mind separating the formatting prs ? it makes it more difficult to review. If you need configuring vscode or something else let me know.

Chillee · 2023-09-13T22:19:04Z

torch/_inductor/codegen/triton.py

+        def _get_min_elements_per_thread(
+            src_dtype: torch.dtype, dst_dtype: torch.dtype
+        ) -> int:
+            # fp8 data type conversions has min_elements_per_thread requirements.


Is there an explanation for why this is the case? 🤔

Check the intrinsics in this file (also documented in the comment below): https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. It uses b32 and e4m3x2.

Or err, I saw the link, I'm just trying to get a better understanding for why dtype conversions would have min_elements_per_thread requirements.

I think it's because ptx cvt only provides fp8*2 intrinsics, which are used by Triton: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cvt.
However, I also don't understand why the Triton implementation for fp8_e5m2 requires 4 element per thread. Asked in the Slack channel but didn't get an answer ^^.

Chillee · 2023-09-13T22:21:02Z

torch/_inductor/codegen/triton.py


    @staticmethod
-    def to_dtype(x, dtype: torch.dtype):
+    def to_dtype(x, dtype: torch.dtype, src_dtype: torch.dtype = None):


Can we not get src_dtype at this stage? Why do we need to plumb src_dtype through the entire lowering?

This is because only fp8 conversion has special requirement on min_elements_per_thread. Any suggestions on better ways to implement this logic?

Yeah I was just a bit surprised that we don't have this information elsewhere during this lowering. I was thinking whether we can just get the dtype directly from x, but I think in general, we don't actually keep around dtype information at this stage of the lowering.

ipiszy · 2023-09-13T23:23:17Z

Would you mind separating the formatting prs ? it makes it more difficult to review. If you need configuring vscode or something else let me know.

Yeah will do. I was testing on another H100 host so maybe I did something wrong in the linter configuration...

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

torch/_inductor/codegen/triton_utils.py

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ghstack-source-id: 72bad6f Pull Request resolved: #109168

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

drisspg

I think this looks good modulo linting but I think horrace + elias are a better review on this

eellison

Do we need to update reduction heuristics as well ? What about triton templates ?

eellison · 2023-09-15T21:27:19Z

torch/_inductor/triton_heuristics.py

-                triton_config(size_hints, bs, 1),
-                triton_config(size_hints, 1, bs),
+                triton_config(
+                    size_hints, 32, 32, min_elements_per_thread=min_elements_per_thread


nit: doesn't really matter, but maybe slightly less verbose as min_elem_per_thread

eellison · 2023-09-15T21:28:33Z

torch/_inductor/triton_heuristics.py

    if len(size_hints) == 1:
        if disable_pointwise_autotuning() and not (
            config.max_autotune or config.max_autotune_pointwise
        ):


for fewer changes you could consider triton_config = functools.partial(triton_config, min_elements_per_thread=min_elements_per_thread

I have to rename triton_config to avoid lint errors..

eellison · 2023-09-15T21:30:07Z

torch/_prims_common/__init__.py

 _integer_dtypes = (torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)
 _low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
-_float_dtypes = (torch.float16, torch.bfloat16, torch.float32, torch.float64)
+_float_dtypes = (


Can we replace this with calling dtype.is_floating_point

Good point!

aakhundov · 2023-09-18T11:13:03Z

test/inductor/test_fp8.py

+
+        x_shape = (16, 16, 16)
+
+        with self.assertRaises(Exception):


Would it make sense to be more specific here re. the Exception type (and maybe text)? And below.

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy

Thanks @eellison @aakhundov !

@eellison Changes for reduction heuristics will be in the next PR. I'm still trying to figure out how to do it correctly for different reduction types (persistent_reduction and normal reduction, real reduction (like max) and fused_reduction_pointwise (like layer_norm)).

Triton template changes will come the last. For now we'll just rely on CuBLAS for Gemms since Triton H100 perf is not ideal.

ipiszy · 2023-09-19T04:35:25Z

torch/_prims_common/__init__.py

 _integer_dtypes = (torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)
 _low_precision_dtypes = (torch.float16, torch.bfloat16, torch.complex32)
-_float_dtypes = (torch.float16, torch.bfloat16, torch.float32, torch.float64)
+_float_dtypes = (


Good point!

ipiszy · 2023-09-19T04:41:50Z

torch/_inductor/triton_heuristics.py

    if len(size_hints) == 1:
        if disable_pointwise_autotuning() and not (
            config.max_autotune or config.max_autotune_pointwise
        ):


I have to rename triton_config to avoid lint errors..

Add basic fp8 support in Inductor, including: * Fix fp8 Triton codegen issues; * Add min_elements_per_thread requirement for fp8 related dtype conversions. More details on Triton implementation can be found from https://github.com/openai/triton/blob/10f59d8ce04052521c1bc0cb3a3f8b98918fc7e3/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp#L10. Note that the current implementation only works for Pointwise. Will create follow-up PRs for Reduction. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 ngimel yf225 chenyang78 kadeng muchulee8 aakhundov [ghstack-poisoned]

ipiszy · 2023-09-23T00:54:54Z

I'll merge this PR first to unblock fp8 related testing. Meanwhile, I'm working on adding scalar fp8 conversion support in the Triton repo, and re-visit the Pointwise TritonHeuristics related change after the fix.

ipiszy · 2023-09-23T00:55:04Z

@pytorchbot merge

pytorchmergebot · 2023-09-23T00:56:53Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ipiszy · 2023-09-23T04:39:40Z

@pytorchbot label "topic: not user facing"

ipiszy · 2023-09-23T04:39:55Z

@pytorchbot merge

pytorchmergebot · 2023-09-23T04:41:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Basic fp8 support in Inductor

8d38648

[ghstack-poisoned]

github-actions bot added module: inductor ciflow/inductor labels Sep 13, 2023

ipiszy requested a review from drisspg September 13, 2023 03:55

ipiszy requested review from Chillee, eellison and shunting314 September 13, 2023 03:56

ipiszy added a commit that referenced this pull request Sep 13, 2023

Basic fp8 support in Inductor

98fb51c

ghstack-source-id: 3a3e05e Pull Request resolved: #109168

drisspg reviewed Sep 13, 2023

View reviewed changes

torch/utils/_sympy/value_ranges.py Outdated Show resolved Hide resolved

eellison reviewed Sep 13, 2023

View reviewed changes

Chillee reviewed Sep 13, 2023

View reviewed changes

drisspg reviewed Sep 14, 2023

View reviewed changes

torch/_inductor/codegen/triton_utils.py Show resolved Hide resolved

ipiszy added a commit that referenced this pull request Sep 14, 2023

Basic fp8 support in Inductor

d7494b3

ghstack-source-id: 72bad6f Pull Request resolved: #109168

ipiszy mentioned this pull request Sep 14, 2023

ln + amax + fp8 quant inductor enablement #109301

Closed

ipiszy requested review from Chillee and eellison September 14, 2023 18:43

drisspg approved these changes Sep 15, 2023

View reviewed changes

eellison reviewed Sep 15, 2023

View reviewed changes

aakhundov reviewed Sep 18, 2023

View reviewed changes

drisspg self-requested a review September 19, 2023 01:23

ipiszy commented Sep 19, 2023

View reviewed changes

ipiszy mentioned this pull request Sep 21, 2023

ln + fp8 quant benchmark #109765

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 23, 2023

pytorchmergebot added the merging label Sep 23, 2023

pytorchmergebot removed the merging label Sep 23, 2023

pytorch-bot bot added the topic: not user facing topic category label Sep 23, 2023

pytorchmergebot added the merging label Sep 23, 2023

pytorchmergebot added Merged and removed merging labels Sep 23, 2023

pytorchmergebot closed this in bbdce93 Sep 23, 2023

facebook-github-bot deleted the gh/ipiszy@gmail.com/7/head branch September 26, 2023 14:25

Basic fp8 support in Inductor #109168

Basic fp8 support in Inductor #109168

Uh oh!

Conversation

ipiszy commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/109168

✅ No Failures

Uh oh!

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipiszy commented Sep 13, 2023

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipiszy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipiszy commented Sep 23, 2023

Uh oh!

ipiszy commented Sep 23, 2023

Uh oh!

pytorchmergebot commented Sep 23, 2023

Merge failed

Uh oh!

ipiszy commented Sep 23, 2023

Uh oh!

ipiszy commented Sep 23, 2023

Uh oh!

pytorchmergebot commented Sep 23, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ipiszy commented Sep 13, 2023 •

edited

Loading

pytorch-bot bot commented Sep 13, 2023 •

edited

Loading