KEMBAR78
[Intel GPU] qconv at XPU backend by ZhiweiYan-96 · Pull Request #133080 · pytorch/pytorch · GitHub
Skip to content

Conversation

@ZhiweiYan-96
Copy link
Collaborator

@ZhiweiYan-96 ZhiweiYan-96 commented Aug 9, 2024

Motivation

This PR enables the XPU quantized convolution. The operators it registers are onednn::qconv_prepack, onednn::qconv1d_pointwise, onednn::qconv2d_pointwise, onednn::qconv3d_pointwise. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.

Details

The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in test/inductor/test_mkldnn_pattern_matcher.py where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in torch/_inductor/fx_passes/quantization.py and torch/_inductor/mkldnn_ir.py. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.

In this PR, we share the same int8 quantizer in CPU, namely, X68InductorQuantizer. In next PR #139578, we will append a XPUIndcutorQuantizer which will customized the pt2e behaviors at XPU backend. The capability of XPUInductorQuantizer would gradually grow along with the development of quantized operators in XPU.

Validation

  • UT testing
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_xpu \
   -k test_qconv2d_silu_xpu \
   -k test_qconv2d_relu6_xpu \
   -k test_qconv2d_hardtanh_xpu \
   -k test_qconv2d_hardswish_xpu
  • Runtime exemplification
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945

#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348

Stack from ghstack (oldest at bottom):

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @gujinghui @fengyuan14 @guangyey

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133080

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 83849fa with merge base 3614d13 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Aug 9, 2024
ZhiweiYan-96 added a commit that referenced this pull request Aug 9, 2024
ghstack-source-id: 6e5bfd1
Pull Request resolved: #133080
@ZhiweiYan-96 ZhiweiYan-96 marked this pull request as draft August 9, 2024 08:01
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@EikanWang
Copy link
Collaborator

@ZhiweiYan-96 , pls. reach out to me to review the bunch of PRs as long as they are ready.

@ZhiweiYan-96 ZhiweiYan-96 added the topic: not user facing topic category label Oct 15, 2024
[ghstack-poisoned]
@ZhiweiYan-96 ZhiweiYan-96 added module: xpu Intel XPU related issues and removed ciflow/inductor labels Oct 18, 2024
@ZhiweiYan-96 ZhiweiYan-96 added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request and removed module: inductor ciflow/inductor labels Oct 18, 2024
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/ZhiweiYan-96/22/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/133080)

@EikanWang
Copy link
Collaborator

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/ZhiweiYan-96/22/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/133080)

pytorchmergebot pushed a commit that referenced this pull request Nov 24, 2024
ghstack-source-id: 5ac26c2
Pull Request resolved: #133080
@EikanWang
Copy link
Collaborator

@pytorchbot merge -i

@EikanWang
Copy link
Collaborator

The failed cases are irrelevant to this PR. We have filed a github issue to track it - #141466.

@pytorchmergebot
Copy link
Collaborator

pytorchmergebot pushed a commit that referenced this pull request Nov 26, 2024
ghstack-source-id: 79ec13c
Pull Request resolved: #133080
pytorchmergebot pushed a commit that referenced this pull request Nov 26, 2024
…139578)

# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.

# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).

We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods.  So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.

In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does  not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend.   On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.

Pull Request resolved: #139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
# Motivation
This PR is a precursor to pytorch#133080. The PR extracts common logics in convolution and quantized convolution into `Utils.cpp`. With such modification, these two operators could share codes like input format querying, op layout querying.

Pull Request resolved: pytorch#139580
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/malfet
ghstack dependencies: pytorch#139721
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
# Motivation
This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.

# Details

The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and  `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.

In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR pytorch#139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU.

# Validation
*  UT testing
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_xpu \
   -k test_qconv2d_silu_xpu \
   -k test_qconv2d_relu6_xpu \
   -k test_qconv2d_hardtanh_xpu \
   -k test_qconv2d_hardswish_xpu
```
* Runtime exemplification
```bash
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945

#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348
```

Pull Request resolved: pytorch#133080
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…ytorch#139578)

# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.

# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).

We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods.  So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.

In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does  not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend.   On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.

Pull Request resolved: pytorch#139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: pytorch#133080
c10::optional<at::Tensor> accum,
double accum_scale,
int64_t accum_zero_point,
c10::optional<c10::ScalarType> output_dtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use c10::optional. Use std::optional instead.

@r-barnes
Copy link
Contributor

@ZhiweiYan-96 - Note my review comment: use std::optional, not c10::optional.

@ZhiweiYan-96
Copy link
Collaborator Author

@r-barnes Thanks for your suggestions! I would change in this PR #135189

@github-actions github-actions bot deleted the gh/ZhiweiYan-96/22/head branch January 12, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor module: xpu Intel XPU related issues open source topic: not user facing topic category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants