-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[Quant][PT2E][X86] annotate and convert for linear_dynamic_fp16 #141480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141480
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1c827ef with merge base 2398e75 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
torch/ao/quantization/fx/convert.py
Outdated
| graph.erase_node(node) | ||
| elif dtype == torch.float16: | ||
| raise NotImplementedError("decomposed to float16 op not implemented yet") | ||
| quantize_op = torch.ops.quantized_decomposed.quantize_per_tensor.default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems for dtype==torch.float16, torch.ops.quantized_decomposed.quantize_per_tensor.default has same semantic as to(dtype=torch.float16). Then why not just use to(dtype=torch.float16)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because to will be constant-folded and the pattern will be hard to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel using to might be better, that way we won't have multiple ops doing the same thing, wondering what is needed to use to here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, to will be folded in Inductor. Here is the implementation with to:
https://github.com/pytorch/pytorch/blob/76ad4bb890b66098672e1f0349e83e28f2d6d85e/torch/ao/quantization/fx/convert.py#L345C1-L357C1
The patter I got in Inductor is x/w -> linear, no to is seen.
If we insert quant/dequant, we will see dequant op in Inductor, pattern: x / (w -> dequant) -> linear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can create a variant of: torch.ops.prims.convert_element_type.default
let's say torch.ops.prims.convert_element_type.no_fuse and add the op to this list
pytorch/torch/_inductor/constant_folding.py
Line 132 in fd553b9
| torch.ops.quantized_decomposed.dequantize_per_channel.default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryzh168 Thanks for the suggestion. Do you know how to add a new variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be the same as adding a new custom op:
| @impl(quantized_decomposed_lib, "quantize_per_tensor", "CompositeExplicitAutograd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. So we are actually adding a new op in the quantized_decomposed namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you can do that, I think ideally in torch.ops.prims but not sure the process there, I can check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I have updated the PR. Please take a look. Thanks.
| warnings.warn( | ||
| "Mixed dynamic and static quantization config is not supported." | ||
| ) | ||
| need_skip = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to remove this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we always set is_dynamic=False for linear_dynamic_fp16 but it needs to work with dynamic quantization of other ops. Besides, I didn't see an issue if users use mixed static/dynamic quantization for different ops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I feels this flag:
| dynamic_state: Optional[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have reverted this change because this is not an issue for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same comment as @leslie-fang-intel . It seems simpler to just insert to(dtype=torch.half) instead of "quantize" in the graph.
Unfortunately, |
|
|
||
|
|
||
| quantized_decomposed_lib.define( | ||
| "convert_element_type.no_fuse(Tensor input, ScalarType dtype) -> Tensor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to add this op to
pytorch/torch/_inductor/constant_folding.py
Line 132 in fd553b9
| torch.ops.quantized_decomposed.dequantize_per_channel.default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this in the next PR. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, thanks!
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…rch#141480) Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer` After `convert_pt2e`, the pattern will be ``` x | linear <- to_fp32 <- to_fp16 <- w ``` **Test plan** ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16 ``` Pull Request resolved: pytorch#141480 Approved by: https://github.com/jgong5, https://github.com/jerryzh168
…rch#141480) Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer` After `convert_pt2e`, the pattern will be ``` x | linear <- to_fp32 <- to_fp16 <- w ``` **Test plan** ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16 ``` Pull Request resolved: pytorch#141480 Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack-source-id: fbdfaa7 Pull Request resolved: pytorch/pytorch#141480
Stack from ghstack (oldest at bottom):
Annotate linear node for
linear_dynamic_fp16withX86InductorQuantizerAfter
convert_pt2e, the pattern will beTest plan
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @SherlockNoMad @EikanWang @wenzhe-nrv