[Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 #141549

Xia-Weiwen · 2024-11-26T06:47:41Z

Stack from ghstack (oldest at bottom):

Description
For linear_dynamic_fp16, we insert quantize and dequantize between x/w and linear to have the following pattern:

  x
  |
linear <- to_fp32 <- to_fp16 <- w

In Inductor, the pattern we finally see will be

fp32 activation
  |
(reshape)
  |
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
  |
(reshape)

Or

fp32 activation
  |
expand
  |
 bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
  |
(add)

The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.

Fuse the pattern with weight prepack, and we get

fp32 activation
  |
onednn.linear_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight

After freezing, the prepack op is gone.

Test plan

python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_dynamic_fp16

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

Differential Revision: D66802159

[ghstack-poisoned]

pytorch-bot · 2024-11-26T06:47:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141549

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 67037d6 with merge base 795f28a ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, lf.windows.4xlarge.nonephemeral) (gh) (similar failure)
cpp/quantized_test 1/1 failed!

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#141703)
convnext_base

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/fx_passes/quantization.py

[ghstack-poisoned]

Xia-Weiwen · 2024-12-03T11:40:53Z

Hi @jerryzh168 Could you please review? Thanks.

jerryzh168 · 2024-12-03T22:45:44Z

torch/_inductor/fx_passes/quantization.py

+    )
+    def linear_dynamic_fp16_weight_prepack(match: Match, *args, **kwargs):
+        """
+        Match the pattern:


this is complicated, is there any way we can get the pattern from tracing a higher level pattern?

Thanks for the suggestion. Do you have any examples as references? Thanks.

for example this is how we get pattern for qat:

pytorch/torch/ao/quantization/pt2e/qat_utils.py

Line 676 in 7dfb439

match_pattern = _get_aten_graph_module_for_pattern(

Thanks. However, here we use the pattern matcher in Inductor and the pattern is described by things like CallFunction. So, looks like we cannot generate the pattern from tracing. And now we use hand-written patterns as defined in _generate_linear_dynamic_fp16_pattern above. So, looks like we cannot generate the pattern from tracing.

Hi @jerryzh168 Do you have more comments on this? Thanks.

@sanchitintel Thanks for the pointer. Is it done by tracing or handwriting?

Individual patterns defined in fuse_attention.py with torch API are traced to produce serialized patterns of the kind you were alluding to (a CallFunction that may be nested). This approach helps avoid manually writing nested CallFunctions.

How to do the tracing? Thanks.

Please refer to this code-flow as an example -

pytorch/torch/_inductor/fx_passes/fuse_attention.py

Lines 912 to 914 in b31d3b2

def _sfdp_init():

for key, register_replacement_kwargs in _get_sfdp_patterns():

gen_register_replacement(key, **register_replacement_kwargs)

You could try playing around with this code, as it may help narrow down on what underlying Inductor API you could use for your use-case to trace nested CallFunctions corresponding to high-level patterns written with torch API.

I'm guessing it's probably this method, but I haven't verified -

pytorch/torch/_inductor/pattern_matcher.py

Line 1461 in 653efe1

pattern = gen_pattern(search_fn, example_inputs, trace_fn, scalar_workaround)

Thanks!

It looks quite complicated. I will try to understand it later. Thanks.

test/inductor/test_mkldnn_pattern_matcher.py

jerryzh168 · 2024-12-05T05:33:01Z

let me import to check internal CI

jerryzh168 · 2024-12-05T05:36:01Z

@jerryzh168 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jerryzh168 · 2024-12-07T03:00:32Z

@pytorchbot merge

pytorchmergebot · 2024-12-07T03:02:16Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…elu (#141556) **Description** Fuse and prepack weight for `linear_dynamic_fp16` with post op relu. In Inductor, the pattern we see is ``` fp32 activation | (reshape) | mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight | (reshape) <- relu ``` Or ``` fp32 activation | expand | bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight | (add) <- relu ``` The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases. Fuse the pattern with weight prepack, and we get ``` fp32 activation | onednn.linear_relu_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight ``` After freezing, the prepack op is gone. **Test plan** ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_relu_dynamic_fp16 ``` Pull Request resolved: #141556 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: #141549

**Description** For `linear_dynamic_fp16`, we insert `quantize` and `dequantize` between x/w and linear to have the following pattern: ``` x | linear <- to_fp32 <- to_fp16 <- w ``` In Inductor, the pattern we finally see will be ``` fp32 activation | (reshape) | mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight | (reshape) ``` Or ``` fp32 activation | expand | bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight | (add) ``` The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases. Fuse the pattern with weight prepack, and we get ``` fp32 activation | onednn.linear_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight ``` After freezing, the prepack op is gone. **Test plan** ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_dynamic_fp16 ``` Differential Revision: [D66802159](https://our.internmc.facebook.com/intern/diff/D66802159) Pull Request resolved: #141549 Approved by: https://github.com/jgong5, https://github.com/jerryzh168

…elu (pytorch#141556) **Description** Fuse and prepack weight for `linear_dynamic_fp16` with post op relu. In Inductor, the pattern we see is ``` fp32 activation | (reshape) | mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight | (reshape) <- relu ``` Or ``` fp32 activation | expand | bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight | (add) <- relu ``` The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases. Fuse the pattern with weight prepack, and we get ``` fp32 activation | onednn.linear_relu_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight ``` After freezing, the prepack op is gone. **Test plan** ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_relu_dynamic_fp16 ``` Pull Request resolved: pytorch#141556 Approved by: https://github.com/jgong5, https://github.com/jerryzh168 ghstack dependencies: pytorch#141549

Update

7316ffa

[ghstack-poisoned]

Xia-Weiwen requested review from digantdesai, jerryzh168, jianyuh, kimishpatel and salilsdesai as code owners November 26, 2024 06:47

Xia-Weiwen mentioned this pull request Nov 26, 2024

[Quant][PT2E][X86] annotate and convert for linear_dynamic_fp16 #141480

Closed

pytorch-bot bot added ciflow/inductor module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: quantization release notes category labels Nov 26, 2024

Xia-Weiwen marked this pull request as draft November 26, 2024 06:48

pytorchbot added the open source label Nov 26, 2024

Xia-Weiwen added the intel This tag is for PR from Intel label Nov 26, 2024

Xia-Weiwen mentioned this pull request Nov 26, 2024

[Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 with relu #141556

Closed

Xia-Weiwen requested review from jgong5 and leslie-fang-intel November 26, 2024 11:47

leslie-fang-intel reviewed Nov 26, 2024

View reviewed changes

torch/_inductor/fx_passes/quantization.py Outdated Show resolved Hide resolved

Xia-Weiwen requested a review from leslie-fang-intel November 27, 2024 01:35

Xia-Weiwen added 5 commits November 26, 2024 18:48

Update

7f22e56

[ghstack-poisoned]

Update

2af7791

[ghstack-poisoned]

Update

4dfe432

[ghstack-poisoned]

Update

fbeff58

[ghstack-poisoned]

Update

67037d6

[ghstack-poisoned]

jgong5 approved these changes Dec 2, 2024

View reviewed changes

Xia-Weiwen marked this pull request as ready for review December 2, 2024 10:56

jerryzh168 reviewed Dec 3, 2024

View reviewed changes

test/inductor/test_mkldnn_pattern_matcher.py Show resolved Hide resolved

jerryzh168 approved these changes Dec 3, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 5, 2024

pytorchmergebot added the merging label Dec 7, 2024

pytorchmergebot added the Merged label Dec 7, 2024

pytorchmergebot closed this in c863227 Dec 7, 2024

pytorchmergebot removed the merging label Dec 7, 2024

github-actions bot deleted the gh/Xia-Weiwen/21/head branch January 7, 2025 02:05

	def _sfdp_init():
	for key, register_replacement_kwargs in _get_sfdp_patterns():
	gen_register_replacement(key, **register_replacement_kwargs)

[Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 #141549

[Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 #141549

Uh oh!

Conversation

Xia-Weiwen commented Nov 26, 2024 • edited by jerryzh168 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141549

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

Xia-Weiwen commented Dec 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerryzh168 commented Dec 5, 2024

Uh oh!

jerryzh168 commented Dec 5, 2024

Uh oh!

jerryzh168 commented Dec 7, 2024

Uh oh!

pytorchmergebot commented Dec 7, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Xia-Weiwen commented Nov 26, 2024 •

edited by jerryzh168

Loading

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

sanchitintel Dec 5, 2024 •

edited

Loading

sanchitintel Dec 5, 2024 •

edited

Loading