[Intel GPU] qconv at XPU backend #133080

ZhiweiYan-96 · 2024-08-09T07:55:33Z

Motivation

This PR enables the XPU quantized convolution. The operators it registers are onednn::qconv_prepack, onednn::qconv1d_pointwise, onednn::qconv2d_pointwise, onednn::qconv3d_pointwise. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.

Details

The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in test/inductor/test_mkldnn_pattern_matcher.py where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in torch/_inductor/fx_passes/quantization.py and torch/_inductor/mkldnn_ir.py. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.

In this PR, we share the same int8 quantizer in CPU, namely, X68InductorQuantizer. In next PR #139578, we will append a XPUIndcutorQuantizer which will customized the pt2e behaviors at XPU backend. The capability of XPUInductorQuantizer would gradually grow along with the development of quantized operators in XPU.

Validation

UT testing

python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_xpu \
   -k test_qconv2d_silu_xpu \
   -k test_qconv2d_relu6_xpu \
   -k test_qconv2d_hardtanh_xpu \
   -k test_qconv2d_hardswish_xpu

Runtime exemplification

#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945

#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348

Stack from ghstack (oldest at bottom):

-> [Intel GPU] qconv at XPU backend #133080

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @gujinghui @fengyuan14 @guangyey

[ghstack-poisoned]

pytorch-bot · 2024-08-09T07:55:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133080

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 83849fa with merge base 3614d13 ():

NEW FAILURES - The following jobs have failed:

xpu / linux-jammy-xpu-2025_0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh)
inductor/test_compiled_optimizers.py::CompiledOptimizerTests::test_adadelta_maximize_xpu
xpu / linux-jammy-xpu-2025_0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh)
inductor/test_compiled_optimizers.py::CompiledOptimizerTests::test_adadelta_weight_decay_xpu

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
convnext_base
xpu / linux-jammy-xpu-2025_0-py3.9 / test (default, 1, 4, linux.idc.xpu) (gh) (similar failure)
##[error]An error has occurred while creating the zip file for upload

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (#141498)
convnext_base

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 6e5bfd1 Pull Request resolved: #133080

[ghstack-poisoned]

EikanWang · 2024-10-10T10:47:25Z

@ZhiweiYan-96 , pls. reach out to me to review the bunch of PRs as long as they are ready.

[ghstack-poisoned]

pytorchmergebot · 2024-11-24T12:30:55Z

Successfully rebased gh/ZhiweiYan-96/22/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/133080)

EikanWang · 2024-11-24T15:14:32Z

@pytorchbot rebase -b main

pytorchmergebot · 2024-11-24T15:16:01Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-11-24T15:16:13Z

Successfully rebased gh/ZhiweiYan-96/22/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/133080)

ghstack-source-id: 5ac26c2 Pull Request resolved: #133080

EikanWang · 2024-11-26T02:16:32Z

@pytorchbot merge -i

EikanWang · 2024-11-26T02:18:11Z

The failed cases are irrelevant to this PR. We have filed a github issue to track it - #141466.

pytorchmergebot · 2024-11-26T02:18:25Z

Merge started

Your change will be merged while ignoring the following 5 checks: xpu / linux-jammy-xpu-2025_0-py3.9 / test (default, 1, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025_0-py3.9 / test (default, 3, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025_0-py3.9 / test (default, 4, 4, linux.idc.xpu), inductor / cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 79ec13c Pull Request resolved: #133080

…139578) # Motivation This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend. # Detailed The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion). We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class. In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully. Pull Request resolved: #139578 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168 ghstack dependencies: #133080

# Motivation This PR is a precursor to pytorch#133080. The PR extracts common logics in convolution and quantized convolution into `Utils.cpp`. With such modification, these two operators could share codes like input format querying, op layout querying. Pull Request resolved: pytorch#139580 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/malfet ghstack dependencies: pytorch#139721

# Motivation This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library. # Details The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library. In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR pytorch#139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU. # Validation * UT testing ```bash python test/inductor/test_mkldnn_pattern_matcher.py -v \ -k test_qconv2d_xpu \ -k test_qconv2d_silu_xpu \ -k test_qconv2d_relu6_xpu \ -k test_qconv2d_hardtanh_xpu \ -k test_qconv2d_hardswish_xpu ``` * Runtime exemplification ```bash #qconv2d onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945 #qconv2d_silu onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348 ``` Pull Request resolved: pytorch#133080 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman

…ytorch#139578) # Motivation This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend. # Detailed The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion). We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods. So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class. In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend. On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully. Pull Request resolved: pytorch#139578 Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168 ghstack dependencies: pytorch#133080

r-barnes · 2024-12-10T20:26:01Z

aten/src/ATen/native/mkldnn/xpu/detail/QConv.cpp

+    c10::optional<at::Tensor> accum,
+    double accum_scale,
+    int64_t accum_zero_point,
+    c10::optional<c10::ScalarType> output_dtype,


Do not use c10::optional. Use std::optional instead.

r-barnes · 2024-12-10T20:26:27Z

@ZhiweiYan-96 - Note my review comment: use std::optional, not c10::optional.

ZhiweiYan-96 · 2024-12-12T04:55:47Z

@r-barnes Thanks for your suggestions! I would change in this PR #135189

Update

a9d0313

[ghstack-poisoned]

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Aug 9, 2024

ZhiweiYan-96 added a commit that referenced this pull request Aug 9, 2024

[Intel GPU] qconv at XPU backend

9ca9ba9

ghstack-source-id: 6e5bfd1 Pull Request resolved: #133080

pytorchbot added the open source label Aug 9, 2024

ZhiweiYan-96 marked this pull request as draft August 9, 2024 08:01

ZhiweiYan-96 mentioned this pull request Aug 13, 2024

[Intel GPU] qlinear at XPU backend #133307

Closed

ZhiweiYan-96 added 2 commits August 19, 2024 12:30

Update

3b6f631

[ghstack-poisoned]

Update

e224183

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 21, 2024

Update

c7743c5

[ghstack-poisoned]

ZhiweiYan-96 mentioned this pull request Sep 5, 2024

[Intel GPU] qconv_pointwise.binary XPU support #135189

Closed

Update

3645172

[ghstack-poisoned]

This was referenced Sep 6, 2024

[Intel GPU] qlinear_pointwise.binary[_tensor] XPU support #135337

Closed

[Intel GPU] qconv.pointwise with mixed dtype XPU support #135465

Closed

ZhiweiYan-96 mentioned this pull request Sep 26, 2024

[Intel GPU] qlinear.pointwise with mixed dtype support #136753

Closed

Update

1717d1c

[ghstack-poisoned]

ZhiweiYan-96 added the topic: not user facing topic category label Oct 15, 2024

Update

dff28a0

[ghstack-poisoned]

ZhiweiYan-96 added module: xpu Intel XPU related issues and removed ciflow/inductor labels Oct 18, 2024

pytorch-bot bot added the ciflow/inductor label Oct 18, 2024

ZhiweiYan-96 added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request and removed module: inductor ciflow/inductor labels Oct 18, 2024

Update

43c9d21

[ghstack-poisoned]

Update

83849fa

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Nov 24, 2024

[Intel GPU] qconv at XPU backend

5558fc2

ghstack-source-id: 5ac26c2 Pull Request resolved: #133080

ZhiweiYan-96 requested a review from EikanWang November 25, 2024 08:18

EikanWang approved these changes Nov 26, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 26, 2024

pytorchmergebot added the Merged label Nov 26, 2024

pytorchmergebot closed this in 648f5d9 Nov 26, 2024

pytorchmergebot removed the merging label Nov 26, 2024

pytorchmergebot pushed a commit that referenced this pull request Nov 26, 2024

[Intel GPU] qconv at XPU backend

32f3f9e

ghstack-source-id: 79ec13c Pull Request resolved: #133080

r-barnes reviewed Dec 10, 2024

View reviewed changes

github-actions bot deleted the gh/ZhiweiYan-96/22/head branch January 12, 2025 02:10

[Intel GPU] qconv at XPU backend #133080

[Intel GPU] qconv at XPU backend #133080

Uh oh!

Conversation

ZhiweiYan-96 commented Aug 9, 2024 • edited by pytorchmergebot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Details

Validation

Uh oh!

pytorch-bot bot commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133080

❌ 2 New Failures, 3 Unrelated Failures

Uh oh!

EikanWang commented Oct 10, 2024

Uh oh!

pytorchmergebot commented Nov 24, 2024

Uh oh!

EikanWang commented Nov 24, 2024

Uh oh!

pytorchmergebot commented Nov 24, 2024

Uh oh!

pytorchmergebot commented Nov 24, 2024

Uh oh!

EikanWang commented Nov 26, 2024

Uh oh!

EikanWang commented Nov 26, 2024

Uh oh!

pytorchmergebot commented Nov 26, 2024

Merge started

Uh oh!

r-barnes Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

r-barnes commented Dec 10, 2024

Uh oh!

ZhiweiYan-96 commented Dec 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ZhiweiYan-96 commented Aug 9, 2024 •

edited by pytorchmergebot

Loading

pytorch-bot bot commented Aug 9, 2024 •

edited

Loading