-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Optimize reduction + amax fusion #111122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize reduction + amax fusion #111122
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/111122
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit 2da5e75 with merge base 547a116 ( FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/_inductor/dependencies.py
Outdated
# Input node has already been realized. Return its size and reduction_size. | ||
return input_node.get_size(), input_node.get_reduction_size() | ||
|
||
# This is one issue: what if there are permutations between the input node and its dependent realized nodes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jansel Wonder do you have any suggestions for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to permutations there are views which change the ndimension.
Is it ok if this function is approximate? Or are there correctness issues if it is wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using reduction_sizes from dependent nodes have a better chance to fuse these nodes.
e.g. The current case is:
x1 = layer_norm(x0)
x2 = amax(x1)
x3 = to_fp8(x1)
Inductor generates these nodes:
n0=WelfordReduction()
n1=WelfordReduction()
n2=WelfordReduction()
n3=Pointwise()
n4=Reduction()
n5=Pointwise()
Currently n0, n1, n2, n3, n5 are fused together. n3, n4 are fused together.
I'd like to make first level reduction ranges of n4 the same as n0 / n1 / n2, so that n0, n1, n2, n3, first level n4, n5 can be fused together.
So it seem to me that we cannot use approximate values here for n4 reduction sizes.
test/inductor/test_fp8.py
Outdated
batch_size, sequence_length, hidden_size = shape | ||
|
||
def amax_fp8(x: Tensor, scale: Tensor): | ||
y = torch.max(torch.abs(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this use torch.amax
instead of older torch.max
? If max
is not intentional, I think using amax
to mean "return the values without indices" is clearer
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
|
||
from .ir import ComputedBuffer, Loops | ||
|
||
if not isinstance(input_node.data.data, Loops): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some checks to ensure .data
and .data.data
exist. There are some cases like views that result in different nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure. I added some checks in the callsite, let me also add checks here for safety.
if hasattr(input_node, "get_size") and hasattr( | ||
input_node, "get_reduction_size" | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a method would be cleaner than these hasattr checks.
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. From Inductor nightly benchmark test: There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.  cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jansel !
|
||
from .ir import ComputedBuffer, Loops | ||
|
||
if not isinstance(input_node.data.data, Loops): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure. I added some checks in the callsite, let me also add checks here for safety.
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot label "topic: not user facing" |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) [ghstack-poisoned]
Summary: In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. imported-using-ghimport Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D50544876 Pulled By: ipiszy
Summary: In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. imported-using-ghimport Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D50544876 Pulled By: ipiszy
@ipiszy This PR caused a significant regression in TIMM dm_nfnet_f0 repro command:
Can you take a look? cc @eellison |
In #111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: #111781 Approved by: https://github.com/malfet, https://github.com/jansel
In #111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue. Test plan: `python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor` Results before this PR: 0.869x Results after this PR: 1.232x Benchmark results:  <img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b"> Pull Request resolved: #112297 Approved by: https://github.com/jansel
FYI this is fixed by #112297. |
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels. Benchmark: ``` python test/inductor/test_fp8.py -k test_layernorm_fp8_quant_benchmark Before this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.13262102689486555ms, Eager: 0.8211962616822429ms, LN only Inductor: 0.09606276150627614ms. After this PR: Config: float8_dtype=torch.float8_e5m2, shape=(4, 2048, 4096). Benchmark results: Inductor: 0.08281274131274131ms, Eager: 0.8217452830188678ms, LN only Inductor: 0.09586902286902287ms. ``` LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16. From Inductor nightly benchmark test: There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.  Pull Request resolved: pytorch#111122 Approved by: https://github.com/jansel
In pytorch#111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: pytorch#111781 Approved by: https://github.com/malfet, https://github.com/jansel
…#112297) In pytorch#111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue. Test plan: `python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor` Results before this PR: 0.869x Results after this PR: 1.232x Benchmark results:  <img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b"> Pull Request resolved: pytorch#112297 Approved by: https://github.com/jansel
In pytorch#111122, an optimization is introduced for reduction() + () + multi-level reduction. In this case, we make a multi-level reduction first-level reduction ranges the same as the previous reduction ranges so that the Inductor has better chances to fuse the first reduction and the first-level reduction of the multi-level reduction kernel together. There is a corner case that the multi-level reduction kernel has `keepdim=True`. In this case, ranges of the multi-level reduction kernel is not empty, and the dim info needs to be used to create the inner loader of the first-level reduction kernel. To keep the logic simple, for now we simply disable optimization when `keepdim=True`. Differential Revision: [D50544876](https://our.internmc.facebook.com/intern/diff/D50544876) Pull Request resolved: pytorch#111781 Approved by: https://github.com/malfet, https://github.com/jansel
…#112297) In pytorch#111122, an optimization is introduced for reduction + pointwise + multi-level reduction fusion. The main idea of this optimization is to have the first-level reduction of the multi-level reduction reuses the reduction sizes of the first reduction kernel so that there are better chances that the first reduction kernel and the first-level reduction of the multi-level reduction kernel can be fused. However, it introduces a bug for pattern pointwise + multi-level reduction, where the first-level reduction kernel wrongly reuses the reduction ranges (which is []) from the previous pointwise kernel. This PR fixes this issue. Test plan: `python timm_models.py --training --amp --performance --only=dm_nfnet_f0 --inductor` Results before this PR: 0.869x Results after this PR: 1.232x Benchmark results:  <img width="1491" alt="Screenshot 2023-10-30 at 3 10 06 PM" src="https://github.com/pytorch/pytorch/assets/10527447/608d26ea-dcc5-4f2a-8700-4a928701392b"> Pull Request resolved: pytorch#112297 Approved by: https://github.com/jansel
This PR optimizes cases like layer_norm + fp8 quant (which includes amax and fp8 quant) fusion when amax is split into multiple reduction kernels.
Benchmark:
LN + fp8 quant is even faster than LN itself. The reason could be that LN + fp8 outputs fp8 while LN outputs fp16.
From Inductor nightly benchmark test:
There are perf differences in cuda_graph / cuda_graph_dynamic / default runs, but no difference in inductor_max_autotune. So it seems to me that the perf differences are mostly like fluctuations.
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler