-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[quant][gpu] Adding quantized conv operator in cudnn #70622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slowFor more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 343292c (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: be41e0a Pull Request resolved: #70622
|
@jerryzh168 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: ``` > USE_EXPERIMENTAL_CUDNN_V8_API=1 python setup.py install > python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D33409155](https://our.internmc.facebook.com/intern/diff/D33409155) [ghstack-poisoned]
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: c76e3e4 Pull Request resolved: #70622
|
Hi @jerryzh168, what is your experience with cuDNN on int8 like? Recently I tried to support cuDNN int8 on NHWC layout, but I got an impression that cuDNN does not support int32 output. What I need is I noticed that you are doing |
Hi @masahi, I just started with cuDNN on int8, we are planning to support native quantized GPU backend through cuDNN this half. About the layout support, I'm using cudnn_v8 apis, and looks like the data layout is not yet exposed in the api: https://github.com/NVIDIA/cudnn-frontend/blob/main/include/cudnn_frontend_Tensor.h#L37, so my guess is that it's using the default layout (might be NCHW). I haven't been able to get this working yet, current it can't find any engines, still debugging and trying to find the problem. please let me know if you have NCHW working. |
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: ``` > USE_EXPERIMENTAL_CUDNN_V8_API=1 python setup.py install > python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D33409155](https://our.internmc.facebook.com/intern/diff/D33409155) [ghstack-poisoned]
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 158e83f Pull Request resolved: #70622
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: ``` > USE_EXPERIMENTAL_CUDNN_V8_API=1 python setup.py install > python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D33409155](https://our.internmc.facebook.com/intern/diff/D33409155) [ghstack-poisoned]
Hi @masahi, this PR is working now, we do need NHWC layout for input, weight and output . |
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 TODO: 1. Support bias, relu, support more parameter flexibilities 2. Use the packed_prams api Test Plan: ``` > USE_EXPERIMENTAL_CUDNN_V8_API=1 python setup.py install > python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D33409155](https://our.internmc.facebook.com/intern/diff/D33409155) [ghstack-poisoned]
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 TODO: 1. Support bias, relu, support more parameter flexibilities 2. Use the packed_prams api Test Plan: ``` > USE_EXPERIMENTAL_CUDNN_V8_API=1 python setup.py install > python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D33409155](https://our.internmc.facebook.com/intern/diff/D33409155) [ghstack-poisoned]
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 TODO: 1. Support bias, relu, support more parameter flexibilities 2. Use the packed_prams api Test Plan: ``` > USE_EXPERIMENTAL_CUDNN_V8_API=1 python setup.py install > python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn ``` Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D33409155](https://our.internmc.facebook.com/intern/diff/D33409155) [ghstack-poisoned]
Summary: This PR is the initial PR to add eager mode quantized GPU operator support, we'll start with convolution, following cudnn fp32 Conv code and the example cudnn frontend code #51390 https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557 Test Plan: python test/test_quantization.py TestQuantizedConv.test_qconv2d_cudnn Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 18e9c44 Pull Request resolved: #70622
|
@jerryzh168 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
cc @eqy |
|
@jerryzh168 Would it be possible to reuse v8 methods that are planned to be introduced with V8 API Convolutions? e.g., in |
…operator in cudnn [PR currently incomplete]" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…n [PR currently incomplete]" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` ghstack-source-id: 2e4a852 Pull Request resolved: #73959
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` ghstack-source-id: 586d55e Pull Request resolved: #73959
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
…operator in cudnn" Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Differential Revision: [D34824251](https://our.internmc.facebook.com/intern/diff/D34824251) [ghstack-poisoned]
Summary: This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` ghstack-source-id: 0a2c6c1 Pull Request resolved: #73959
Summary: Pull Request resolved: #73959 This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test Plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Imported from OSS Differential Revision: D34824251 D34824251 Reviewed By: jerryzh168 Pulled By: dzdang fbshipit-source-id: 47139796782ade8d030ba2f9968a9abdd3a91d2f
Summary: Pull Request resolved: #73959 This PR is similar to #70622, but for the linear operator. Unlke PR 70622, this implementations directly uses packed parameters, rather than a refactorization, as was done for the conv operator, and also directly implements bias & relu. Currently, int8 matrix multiplication is not supported in cudnn. The ETA for this support is in the first half of April 2022. As a temporary workaround, we cast our int8 tensors to fp32 prior to matmul. Test Plan: ``` python test/test_quantization.py TestQuantizedLinear.test_qlinear_cudnn ``` Imported from OSS Differential Revision: D34824251 D34824251 Reviewed By: jerryzh168 Pulled By: dzdang fbshipit-source-id: 47139796782ade8d030ba2f9968a9abdd3a91d2f (cherry picked from commit eade369)
Stack from ghstack:
Summary:
This PR is the initial PR to add eager mode quantized GPU operator support, we'll start
with convolution, following cudnn fp32 Conv code and the example cudnn frontend code
#51390
https://github.com/NVIDIA/cudnn-frontend/blob/main/samples/fusion_sample.cpp#L557
TODO:
Test Plan:
debug command:
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D33409155