-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Quant Tool] Introduce get_qdq_config() helper to get QDQ configurations #22677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
adrianlizarraga
commented
Oct 31, 2024
onnxruntime/test/python/quantization/test_get_int_qdq_config.py
Outdated
Show resolved
Hide resolved
fajin-corp
reviewed
Nov 5, 2024
fajin-corp
reviewed
Nov 5, 2024
fajin-corp
previously approved these changes
Nov 5, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fajin-corp
approved these changes
Nov 6, 2024
adrianlizarraga
added a commit
that referenced
this pull request
Nov 6, 2024
…ons (#22677) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.
yf711
pushed a commit
that referenced
this pull request
Nov 11, 2024
…ons (#22677) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.
guschmue
pushed a commit
that referenced
this pull request
Dec 2, 2024
…ons (#22677) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this pull request
Dec 11, 2024
…ons (microsoft#22677) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this pull request
Dec 11, 2024
…ons (microsoft#22677) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this pull request
Dec 11, 2024
…ons (microsoft#22677) ### Description Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... # Get QDQ configuration qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) # Quantize model quantize(float_model_path, qdq_model_path, qdq_config) ``` ### Motivation and Context Need a version of `get_qnn_qdq_config()` that is not EP-specific.
alex-spacemit
pushed a commit
to spacemit-com/onnxruntime
that referenced
this pull request
Jun 22, 2025
[CUDA/ROCm/Migraphx] consolidate gpu data transfer (microsoft#22609) Consolidate the gpu data transfer in CUDA, ROCm and Migraphx EP. (1) Remove some redundant stream synchronize on default stream according to spec of cudaMemcpy (2) consolidate CUDA, ROCm and MigrphaX to try use same logic. This is a follow up on reviewing microsoft#22589. https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior * For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, **but the DMA to final destination may not have completed**. * For transfers from pinned host memory to device memory, the function is synchronous with respect to the host. * For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed. * For transfers from device memory to device memory, **no host-side synchronization is performed**. * For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host. * For transfers between device memory and pageable host memory, the function might be synchronous with respect to host. * For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host. * If pageable memory must first be staged to pinned memory, the driver may synchronize with the stream and stage the copy into pinned memory. * For all other transfers, the function should be fully asynchronous. https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html If host or dest are not pinned, the memory copy will be performed synchronously. For best performance, use hipHostMalloc to allocate host memory that is transferred asynchronously. on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies. For hipMemcpy, the copy is always performed by the device associated with the specified stream. For hipMemcpy, the copy is always performed by the current device (set by hipSetDevice). https://github.com/ROCm/ROCm/blob/roc-5.7.x/tools/autotag/templates/rocm_changes/5.6.1.md ROCm 5.6.1 release note: hipMemcpy device-to-device (intra device) is now asynchronous with respect to the host [DML EP] Cast to bool correctly, adding explicit clip after cast (microsoft#22665) The CastNonStringTester test in CastOpTest was failing due to bitwise mismatches when casting other types to bool. This was caused by bool being represented as uint8 in DML. Added a clipping step in DmlOperatorCast to ensure correct bitwise matching after casting to bool ref: https://dev.azure.com/microsoft/OS/_workitems/edit/44572678 Fixed a minor bug in layout transformation for Resize (microsoft#21954) Since opset 18, 'scales' and 'sizes' constant inputs can be 2D tensors, transpose for 2D tensors are not supported at current implementation, fix it by only allowing 4D constant inputs. Build CUDA and DML together (microsoft#22602) Now, we need to build cuda and dml in one package. But CUDA EP and DML EP can't run in one process. It will throw the exception of `the GPU device instance has been suspended` So the issue is CUDA EP and DML EP coexist in compile time but can't exist in run time. This PR is to split cuda ep test and dml ep test in all unit tests. The solution is to use 2 environment variable, NO_CUDA_TEST and NO_DML_TEST, in CI. For example, if NO_CUDA_TEST is set, the DefaultCudaExecutionProvider will be nullptr, and the test will not run with CUDA EP. In debugging, the CUDAExecutionProvider will not be called. I think, as long as cuda functions, like cudaSetDevice, are not called, DML EP tests can pass. Disabled java test of testDIrectML because it doesn't work now even without CUDA EP. Add concurrency setting to codeql workflow (microsoft#22678) 1. Add concurrency setting to codeql workflow 2. Modify lint workflow's PATH setting. To save machine resource. Revert "[WebNN] Fallback the node when its output doesn't have shape info" (microsoft#22669) Reverts microsoft#22556 since it causes incorrect fallback. [CoreML] ML Program more ops (2/N) (microsoft#22480) - cast - argmax - gelu - cast - LayerNorm - GroupNorm - InstanceNorm <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> [js/webgpu] Optimize MatMul with M = 1 (microsoft#22577) <!-- Describe your changes. --> BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Rework the native library usage so that a pre-built ORT native package can be easily used (microsoft#22345) The local build of the native library was being included by almost every project, but is only needed to run tests. Due to the multiple inclusions attempting to use a pre-built package was clashing with any local builds that were available. Create a helper file to include either a local built of a pre-built package and include that in the two test projects. Cleanup various miscellaous things. Create setup to simplify running on-device tests with the nuget packages. Use suggest-changes@v2 (microsoft#22667) Use suggest-changes@v2 (parkerbxyz/suggest-changes#36 (comment)) to post suggested changes as comments instead of requested changes to streamline the review process. - Also updated the script to `set +e` to ignore exit code only for the linter run. So that if there is errors in dependency installation we can still get signals. [WebNN] Remove some useless verbose logs (microsoft#22690) These logs are not quite useful and create a lot of noise during debugging, especially when working with large models. [CI] Set up proper permissions for linting workflow (microsoft#22696) Allow writing security events to post lint messages on PRs. Fix too strict assert in onnx_quantizer.py (microsoft#22283) Partial answer to issue microsoft#19997. The example succeeds after this change. Fix crash when running ORT under low integrity process like Edge where ETW registration can fail (microsoft#22699) Make ETW provider registration non-fatal and not throw an exception Needs to work under build with exceptions enabled & --disable_exceptions ORT should not crash Addresses microsoft#22475. Private tested by filer of that issue DMMHA: add unit tests; fix CPU, CUDA kernel (microsoft#22567) Fixes: (1) cpu kernel: applying scale before bias and mask like other MHA ops (2) cpu kernel: correct offset during appending past to present. (3) cuda kernel: apply mask if provided; fix output_qk offset. Add DMMHA unit tests [Doc] Add I/O binding example using onnx data type in python API summary (microsoft#22695) Add I/O binding example using onnx data type in python API summary. The API is available since 1.20 release. Follow up of microsoft#22306 to add some documentation. Adjust max chunk size to fix error limit check from DX12 for large resources that are CPU accessible. (microsoft#22680) Adjust max chunk size to fix error limit check from DX12 for large resources that are CPU accessible. Current agility SDK restricts CPU visible buffers to 0xFFFF0000 bytes or slightly smaller than 4GiB. Verified restriction is still in latest Agility SDK 1.614.1. Allocation of Resources 4GiB or larger fail in DX12 verification layer. --------- Co-authored-by: Dwayne Robinson <dwayner@microsoft.com> [WebNN EP] Support Sign and CumSum operators (microsoft#22616) This PR supports Sign and CumSum operators for WebNN EP. @Honry @fdwr PTAL, thanks. [WebNN] Don't skip scalar tensor registration (microsoft#22688) ORT will optimize same scalar initializers into one, we should not skip such scalar registration as a WebNN Constant. Nuget Windows AI Pipeline, Disable SDL Submodules. (microsoft#22711) <!-- Describe your changes. --> Set SDL's git submodule to false. <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> * Previous job's SDL logs:It has 'git submodule sync' command, which means 'git submodule sync synchronizes all submodules while git submodule sync' * After set sdl git submodules to false, the logs don't have 'git submodule sync' command. [WebNN EP] Align QDQ ops with latest Chromium implementation (microsoft#22180) - Pass inputs to WebNN directly, WebNN will handle the broadcasting - If `zero_point` is not provided, make a WebNN Constant with 0 values and same shape as `scale` input [WebNN] Support SimplifiedLayerNormalization op (microsoft#22674) WebNN doesn't provide dedicate op for SimplifiedLayerNormalization, use a couple of WebNN ops to emulate it in WebNN EP. X --> Pow --> ReduceMean --> Add --> Sqrt --> Div -> Mul support WebGPU EP in Node.js binding (microsoft#22660) This change enhances the Node.js binding with the following features: - support WebGPU EP - lazy initialization of `OrtEnv` - being able to initialize ORT with default log level setting from `ort.env.logLevel`. - session options: - `enableProfiling` and `profileFilePrefix`: support profiling. - `externalData`: explicit external data (optional in Node.js binding) - `optimizedModelFilePath`: allow dumping optimized model for diagnosis purpose - `preferredOutputLocation`: support IO binding. ====================================================== `Tensor.download()` is not implemented in this PR. Build pipeline update is not included in this PR. [js/webgpu] Optimize Gemm (microsoft#22706) BUG microsoft#22031 The total Gemm time in demucs model becomes 181.14 ms from over 1000 ms on my iGPUs. <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Refactor the cmake code that is related to delay loading (microsoft#22646) Refactor the cmake code that is related to delay loading. Provide a cmake option to control if delay loading should be enabled or not. Disabling the option when python is enabled, due to a known issue. ONNX Runtime's python package depends on DirectML.dll, but supposedly the DLL should be delay loaded. This PR only refactor the code. It doesn't change the behavior. bugfix add add more ops Change-Id: I3c89e4c74cfe9136c88c7632d68274c8ca5e5fa3 Remove webgpu ep in mobile packaging stages (microsoft#22725) The nuget-zip-java packaging pipeline has been failed for 4 days since it's introduced in microsoft#22591 Update DNNL CI python to 310 (microsoft#22691) <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Revert to err logging instead of LOGS_DEFAULT macro (microsoft#22720) Revert to err logging instead of LOGS_DEFAULT macro due to issue seen during testing. "onnxruntime::logging::LoggingManager::DefaultLogger Attempt to use DefaultLogger but none has been registered." Revert part of PR suggestion to prevent crash for scenario seen in Previous PR microsoft#22699 it was suggested to use LOGS_DEFAULT() but that does not work during early init. Safer to use std::cerr instead like the original PR had it. Replace gsl::narrow with narrow in WebNN code (microsoft#22733) Replace use of `gsl::narrow` with `narrow` to build for WebNN @snnn Building for WebNN with exceptions disabled cannot use `gsl::narrow`. Replace with `narrow` Address issue microsoft#22712 [js/webgpu] Increase workgroupSize if only one workgroup is dispached (microsoft#22709) For reduce related ops, we should increase workgroupSize to improve parallelism if only one workgroup is dispatched. The total ReduceMean time becomes 8.98 ms from 77.79 ms on my iGPUs. Fix GRU tests (microsoft#22716) Many GRU tests were being skipped due to an error in MLOperatorAuthorImpl.cpp. The issue was caused by activation function names not being capitalized (e.g., ‘sigmoid’), while The AttrValue was using mixed cases (e.g., ‘Sigmoid’, ‘LeakyRelu’), which resulted in an ‘unsupported activation function’ error in DMLOperatorRecurrentNeuralNetwork.cpp. This PR fixes the issue by making the DML EP activation function name case-insensitive, and capitalizing the activation function names in the tests. ref PR: microsoft#15914 ref bug: https://dev.azure.com/microsoft/OS/_workitems/edit/44571772 <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: nums11 <numsmt2@gmail.com> support Qnn 2 28 (microsoft#22724) support Qnn 2.28 update default qnn vesion to 2.28 in build pipeline [webgpu] change default validation mode (microsoft#22730) Change default validation mode in Release build from "wgpuOnly" to "basic" Enable CUDA Python Test (microsoft#22717) <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> [C# MauiModelTester] Fix icon name in Info.plist (microsoft#21666) Fix icon name in Info.plist. It now matches the icon at `csharp/tools/MauiModelTester/Resources/AppIcon/onnxruntime_icon.png`. [js/webgpu] Destroy staging buffers aggressively during weights uploading (microsoft#22726) In current implementation, all the staging buffers for weights uploading are destroyed after first batch of kernel execution. It requires a lot of memory as all the staging buffers couldn't be reused. It also hurts the startup time (weights uploading only happens in session creation), as weights uploading is delayed to a very late time. This PR uses a very aggressive way to submit queue and destroy staging buffers, so that the related GPU memory could be reused as much as possible, though the real situation depends on the WebGPU and driver implementation. The aggressive queue submission also moves GPU operations to a very early time, which helps the startup time. Some buffer uploading benchmarks are composed to compare multiple solutions, regarding to the memory and time consumption. Benchmarks can be found at https://github.com/webatintel/webbench/blob/master/webgpu/buffer-upload.html, while detailed test data can be found at https://docs.google.com/document/d/1KgygOkb9ZNzkgzQ_tWOGlEI9ScmMBHDjDojjPFLmVXU/edit. I also tested phi3.5 on 2 machines, first inference time improved from 5141ms to 3579ms and from 4327ms to 2947ms separately. [WebNN EP] Fix issues with MLTensor caching (microsoft#22701) This PR fixes a bug that occurs when searching for compatible `MLTensor` in the cache. We were missing checking the number of dimensions in the shape. This would mean that a cached buffer of shape `[1]` could match for `[1, 1, 256, 256]`. This PR also adds better handling when attempting to force an `MLTensor` to a different shape. [CUDA] Fix NumericLimits (microsoft#22738) * Fix `NumericLimits<float>` that used infinity as max, which is not consistent with `std::numeric_limits<float>::max()` In Windows, (float)(1e+300) is used for INFINITY, which causes compiler error in Visual Studio 2022 v17.12 Preview 5. * Rename `NumericLimits<T>::Min` to Lowest to be consistent with std::numeric_limits * Fix topk implementation: use `NumericLimits<CudaT>` instead of `NumericLimits<T>` in kernel. That could avoid defining a confusing defintion of `NumericLimits<MLFloat16>` that returns half instead of MLFloat16. * Use CUDART_MAX_NORMAL_FP16 if possible. It sets bits value directly, which is faster than converting float to half. Note that NumericLimits does not support __nv_bfloat16 and _nv_fp8_e4m3 and __nv_fp8_e5m2 right now. microsoft#22728 [CUDA/ROCm] Conditionally support ArgMax and ArgMin for opset 12 and above (microsoft#22713) Based on microsoft#9700, and extend it to ArgMin as well. This pull request introduces several enhancements and fixes related to the `ArgMax` and `ArgMin` operators in the CUDA execution provider. The changes ensure proper handling of these operators across different versions and improve kernel registration and fallback mechanisms. Key changes include: * Added new kernel class registrations for `ArgMax` and `ArgMin` for different data types and versions in `onnxruntime/core/providers/cuda/cuda_execution_provider.cc`. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R966-R972) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1209-R1215) [[3]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1657-R1659) [[4]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285L1825-L1827) [[5]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R1933-R1939) [[6]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2174-R2180) * Introduced `ArgMaxOrArgMinNeedFallbackToCPU` function to handle fallback to CPU when the `select_last_index` attribute is set to 1, as CUDA does not support this attribute. [[1]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2597-R2622) [[2]](diffhunk://#diff-57ba769b54dce57acd89df47140ede5f29ea670d61176096076701912d573285R2672-R2674) * Replaced `REGISTER_KERNEL_UNTIL_VERSIONED_TYPED` with `REGISTER_KERNEL_VERSIONED_RANGE_TYPED` and `REGISTER_KERNEL_VERSIONED_SINCE_TYPED` macros for better version handling. [[1]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L19-R29) [[2]](diffhunk://#diff-ee5316fc3898058f70e942d9a84de36be4c7da09f144633a2504236430d5d033L40-R46) * Updated kernel registration for `ArgMax` and `ArgMin` to use the new macros, ensuring proper version handling and support for different data types. * Added safety checks in the `ArgMax` and `ArgMin` classes to ensure `select_last_index` is not set to 1, as it is not supported on CUDA. [[1]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL91-R99) [[2]](diffhunk://#diff-8ab09fef1f4a12cbf3b3432e509f8f1ef561e83c72778a0e047780060aeef6efL101-R117) * Added new tests for `ArgMax` and `ArgMin` operators to verify behavior when `select_last_index` is set to 0, ensuring compatibility with both CPU and CUDA execution providers. [[1]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3340-R3360) [[2]](diffhunk://#diff-77affe1b70d1a9d38c2485f7c6b16ef2b6b541ed94dd727bc9b286f068f1481aR3679-R3699) Improve CUDA kernel coverage for stable diffusion model and hence improve its performance on CUDA [CUDA] Build nhwc ops by default (microsoft#22648) * Build cuda nhwc ops by default. * Deprecate `--enable_cuda_nhwc_ops` in build.py and add `--disable_cuda_nhwc_ops` option Note that it requires cuDNN 9.x. If you build with cuDNN 8, NHWC ops will be disabled automatically. In general, NHWC is faster than NCHW for convolution in Nvidia GPUs with Tensor Cores, and this could improve performance for vision models. This is the first step to prefer NHWC for CUDA in 1.21 release. Next step is to do some tests on popular vision models. If it help in most models and devices, set `prefer_nhwc=1` as default cuda provider option. [Quant Tool] Introduce get_qdq_config() helper to get QDQ configurations (microsoft#22677) Introduces the `get_qdq_config()` function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically: - Instead of requiring the user to pass a dictionary of `extra_options`, the new interface adds function parameters for common settings: - All calibrator settings - Whether activations/weights are symmetric - Whether to keep or fuse relu/clip into Q - Minimum real range for quantization - Dictionary of tensor quantization overrides. - Automatically scans the input floating-point model and fills out the operator types to quantize. Otherwise, only a limited number of operator types would be quantized by default. - Detects if the input model uses external data. If so, ensures that the generated QDQ model also uses external data. - Detects if the model will use newly introduced quantization types (int4/int16) with an older opset. If so, forces the use of the `com.microsoft` domain for Q/DQ ops, which support all types. - Automatically enables the "extra option" called `ForceQuantizeNoInputCheck` to ensure data movement operators (e.g., Transpose) are always quantized. - User can pass a function to indicate which nodes to exclude from quantization. - The user can still pass their own `extra_options` to override any of the above if necessary. ```python from onnxruntime.quantization import get_int_qdq_config, quantize # , ... qdq_config = get_int_qdq_config( float_model, data_reader, calibrate_method=CalibrationMethod.Percentile, calibrate_args={"percentile": 99.98}, # Converted to extra_options activation_type=QuantType.QUInt8, weight_type=QuantType.QInt8, per_channel=True, nodes_to_exclude=["Mul"], # Could also be a function. Ex: `lambda model, node: node.op_type == "Softmax"` # Other options converted to extra_options: min_real_range=0.0001, keep_removable_activations=True, activation_symmetric=True, weight_symmetric=True, ) quantize(float_model_path, qdq_model_path, qdq_config) ``` Need a version of `get_qnn_qdq_config()` that is not EP-specific. [Quant Tool] Prevent int32 quantized bias from clipping by adjusting the weight's scale (microsoft#22020) Fixes scenario in which a bias input quantized to int32 has a scale that is too small. A bias with a scale that is smaller than a certain threshold will overflow the range of an `int32` when quantized, which significantly decreases accuracy. Credit to @yihonglyu for finding out about this issue and the fix. Consider the following Convolution with very small weights and a constant bias input of `[5, -4.5]`.  The QDQ quantizer first computes the following quantization scale for `input_0` and `weight`: - `input_0`: scale=0.5 - `weight`: scale=7.843e-10 **[really small]** The QDQ quantizer then computes the bias input's scale as follows: ``` bias_scale = input_0_scale * weight_0_scale = 0.5 * 7.843e-10 = 3.9215686274509805e-11 ``` This `bias_scale` is too small. Before this PR, the QDQ quantizer would quantize the f32 bias with this `bias_scale`: ``` bias_quant = round(bias_f32 / bias_scale) = round([5.0/bias_scale, -4.5/bias_scale]) = [127500000000, -114750000000] ``` These quantized bias values exceed the range of int32, and so are clipped to [int32.min(), int32.max()], which is very inaccurate. This PR increases the `weight_0_scale` by the necessary amount to ensure that `bias_scale` (which equals `weight_0_scale * input_0_scale`) is appropriate for the int32 quantization type. The smallest valid bias scale is given by the normal scale formula: `bias_smallest_valid_scale = (bias_f32_max - bias_f32_min) / (int32_max - int32_min)` Then, we compute the candidate bias scale: `bias_scale_candidate = input_0_scale * weight_0_scale` If the candidate scale is smaller than the smallest valid scale, we increase the `weight_0_scale` by the necessary ratio: ```python if bias_scale_candidate < bias_smallest_valid_scale: ratio = bias_smallest_valid_scale / bias_scale_candidate weight_0_scale = ratio * weight_0_scale ``` Then, we recompute the final bias scale: ```python bias_scale = input_0_scale * weight_0_scale ``` Here's the above model's quantized output compared to the f32 (ground-truth) output. - Before PR: - f32 model output[0]: **5.0f** - qdq model output[0]: **0.075** - SNR: 0.1369 (higher is better) - After PR: - f32 model output[0]: **5.0f** - qdq model output[0]: **4.992** - SNR: 55.656 (higher is better) [Mobile] Add E2E BrowserStack tests for iOS tests (microsoft#22610) - Changes running the E2E iOS tests from running in App Center to running in BrowserStack - Steps for running locally can be found in the OneNote - Follow-up of microsoft#22117 - App Center (the previous platform for running E2E mobile tests) is getting deprecated in 2025 Additional build steps were required to get the necessary testing artifacts for BrowserStack. App Center consumed an entire folder, while BrowserStack requests the following: 1. a ZIP file of all the tests 2. an IPA file of the test app Here is a rough outline of what is happening in the pipeline: 1. The build_and_assemble_apple_pods.py script builds the relevant frameworks (currently, this means packages for iOS and Mac) 4. The test_apple_packages.py script installs the necessary cocoapods for later steps 5. XCode task to build for testing builds the iOS target for the test app 6. Now that the test app and the tests have been built, we can zip them, creating the tests .zip file 7. To create the IPA file, we need to create a .plist XML file which is generated by the generate_plist.py script. - Attempts to use the Xcode@5 task to automatically generate the plist file failed. - Also, building for testing generates some plist files -- these cannot be used to export an IPA file. 8. We run the Xcode task to build an .xcarchive file, which is required for creating an IPA file. 9. We use xcodebuild in a script step to build an IPA file with the xcarchive and plist files from the last two steps. 10. Finally, we can run the tests using the BrowserStack script. --------- Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> [Quant Tool] Update QDQ Pad, Slice, Softmax (microsoft#22676) Updates python quantization tool: - Ensures QDQ Pad has equal quantization parameters across input and output for certain Pad configurations. - Ensures QDQ Slice always has equal quantization parameters across input and output. - Fixes bug when Softmax is _excluded_ from quantization. QDQ Pad and Slice have lower latency on QNN EP when their quantization parameters are equal. [TensorRT EP] support TensorRT 10.6-GA (microsoft#22644) <!-- Describe your changes. --> * Update CI with TRT 10.6 * Update oss parser to [10.6-GA-ORT-DDS ](https://github.com/onnx/onnx-tensorrt/tree/10.6-GA-ORT-DDS) and update dependency version * Update Py-cuda11 CI to use trt10.6 <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> (There will be 3rd PR to further reduce trt_version hardcoding) [JS/WebGPU] Creating devices with subgroup features enabled if possible (microsoft#21833) This CL make WebGPU backend support subgroup features and thus allow using subgroup optimizations in the future. With this CL WebGPU backends will create devices with subgroups and subgroups-f16 features (both are under origin trial in Chrome) or chromium-experimental-subgroups feature enabled whenever available. This CL would allow WebGPU operator shaders to use subgroup optimizations in the future, and might get some significant speedup with these optimization. [webgpu] fix indices type when it's 4D (microsoft#22758) Fix indices type from `array<u32, 4>` to `vec4<u32>` when the variable is 4D. [DML EP] Prefer MatMulInteger over MatMulIntegerToFloat in case of (microsoft#22469) Skip `MatMulIntegerToFloat` fusion in case of DML EP for cases where model uses Quantization before `MatMulInteger`. This is mainly done to be resource efficient, and we have better `MatMulInteger` Metacommand coverage which computes in int data type <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> [AIX] Fix for AIX build break (microsoft#22745) With recent changes, below build error is found under AIX. ``` ld: 0706-012 The -p flag is not recognized. ld: 0706-012 The -a flag is not recognized. ld: 0706-012 The -t flag is not recognized. ld: 0706-012 The -h flag is not recognized. ld: 0706-012 The -= flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -$ flag is not recognized. ld: 0706-012 The -O flag is not recognized. ld: 0706-027 The -R IGIN flag is ignored. collect2: error: ld returned 255 exit status ``` AIX linker doesn't support -rpath option , so blocking this option under AIX. Replace reference to python 3.8 with python 3.10 (microsoft#22692) This PR will set default python to 3.10 except tools/ci_build/github/azure-pipelines/bigmodels-ci-pipeline.yml. This is needed because we are no longer using python 3.8 This PR excludes changes for Big Models CI, because it will require additional changes. Which will be track in USER STORY 52729 <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix build with GCC 11 (microsoft#22770) Fix a build error seen with GCC 11 when building at Homebrew on our Linux x86_64 Ubuntu 22.04 CI (GitHub action runner). When building latest v1.20.0 at Homebrew (Homebrew/homebrew-core#196547), we hit a build failure with GCC 11: ``` [ 65%] Building CXX object CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o /home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/shims/linux/super/g++-11 -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DHAS_STRING_VIEW=1 -DNSYNC_ATOMIC_CPP11 -DONLY_C_LOCALE=0 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DORT_NO_RTTI -DPLATFORM_POSIX -DPROTOBUF_USE_DLLS -D_GNU_SOURCE -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/utf8_range-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/include/onnxruntime/core/session -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/pytorch_cpuinfo-src/include -I/tmp/onnxruntime-20241103-6403-lh3bwj/build -I/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-src -I/tmp/onnxruntime-20241103-6403-lh3bwj/build/_deps/onnx-build -ffunction-sections -fdata-sections -Wno-restrict -DCPUINFO_SUPPORTED -O3 -DNDEBUG -fPIC -fno-rtti -Wall -Wextra -Wno-deprecated-copy -Wno-tautological-pointer-compare -Wno-nonnull-compare -Wno-ambiguous-reversed-operator -Wno-deprecated-anon-enum-enum-conversion -Wno-undefined-var-template -Wno-deprecated-builtins -Wshorten-64-to-32 -Werror -MD -MT CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -MF CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o.d -o CMakeFiles/onnxruntime_optimizer.dir/tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc.o -c /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc: In function ‘void onnx_transpose_optimization::Permute1DConstant(onnx_transpose_optimization::api::GraphRef&, onnx_transpose_optimization::api::NodeRef&, onnx_transpose_optimization::api::TensorRef&, size_t, std::string_view, const std::vector<long int>&)’: /tmp/onnxruntime-20241103-6403-lh3bwj/onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:1114:10: error: ‘memcpy’ is not a member of ‘std’; did you mean ‘wmemcpy’? 1114 | std::memcpy(dst, src, bytes_per_val); | ^~~~~~ | wmemcpy ``` It is possible this error may not occur on different GCC versions if `cstring` has been indirectly included by another header. WebGPU JSEP: Make shader code not depend on input broadcasting patterns (microsoft#22536) This PR make MatMul shaders not depend on inputs broadcasting pattern, but only depend on input ranks and their shape provided in uniform. This change fix the issue that currently shaders code are different for different broadcasting, but have identical cache key and results in wrong cache hit. [js/webgpu] support GridSample operator (microsoft#22652) <!-- Describe your changes. --> <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> [Quant Tool] Add reduce_range option to get_qdq_config() (microsoft#22782) Adds `reduce_range` option to `get_qdq_config()` Make it easier to set this option when calling get_qdq_config(). Otherwise, user has to set the option manually. Ignore all whitespace lint messages for cpplint (microsoft#22781) Ignore all whitespace lint messages for cpplint. Remove redundant configs in dml/. They are handled automatically by clang-format and creates too much noise in the PR files tab. Add XNNPack build on Linux ARM64 and improve Linux CPU (microsoft#22773) 1. Add XNNPack build on Linux ARM64 2. Build only one python wheel for PR request. [AB#49763](https://aiinfra.visualstudio.com/6a833879-cd9b-44a4-a9de-adc2d818f13c/_workitems/edit/49763) Why I add xnnpack build on Linux ARM64 rather than Windows ARM64. Becuase KleidiAI doesn't support Windows ``` IF(XNNPACK_TARGET_PROCESSOR STREQUAL "arm64" AND XNNPACK_ENABLE_ARM_I8MM AND NOT CMAKE_C_COMPILER_ID STREQUAL "MSVC") IF (XNNPACK_ENABLE_KLEIDIAI) MESSAGE(STATUS "Enabling KleidiAI for Arm64") ENDIF() ELSE() SET(XNNPACK_ENABLE_KLEIDIAI OFF) ENDIF() ``` --------- [VitisAI] Cache node subgraph when necessary (microsoft#22073) <!-- Describe your changes. --> [VitisAI] Cache node subgraph when necessary <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Zhenze Wang <zhenzew@xilinx.com> Co-authored-by: zhenzew <zhenzew@amd.com> [WebNN] QDQ's axis should be used for broadcasting (microsoft#22721) For per-axis quantization/dequantization, WebNN requires the scale and zero_point inputs to be broadcastable. Axis should be used for reshape these two inputs. [WebNN] Support steps >= 1 for slice operator (microsoft#22708) Co-authored-by: Wanming Lin <wanming.lin@intel.com> OVEP Dynamic WorkloadType support (microsoft#22779) Support to set EPdynamic options in OVEP relate to microsoft#22282 --------- Co-authored-by: Javier E. Martinez <javier.e.martinez@intel.com> Add Android QNN Browserstack test (microsoft#22434) Add Android QNN Browserstack test Real device test in CI Revert "enable serialize prepacked weights into data file (microsoft#22256)" (microsoft#22788) This reverts commit c5b6be0. Revert This needs simpler and more robust approach Fix MatMulBnFusion to exclude cases when tensors are not 2D tensors (microsoft#22762) Fixes microsoft#22512, MatMul, Add can be fused into a single Gemm even if tensors dimensions are > 2. The PR excludes that cases. ORT crashes on valid models due to that unexpected fusion. Fix warning - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (microsoft#22800) This PR Fix warning - `LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format` from all Dockerfile <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix build for linux python wheel (microsoft#22801) Fixes command for building Linux python packages by preventing an empty `-p` command-line option from being passed to a subsequent build script: https://github.com/microsoft/onnxruntime/blob/1f3b675453e8412e5c084bfb95997967d0c2eec2/tools/ci_build/github/linux/run_python_dockerbuild.sh#L37 A recent [PR ](microsoft#22773 a new optional command-line option (`-p`) to pass custom python exe paths. We need to check if the option is empty before forwarding the option to a separate build script. Reland "[WebNN] Fallback the node when its output doesn't have shape info" (microsoft#22685) The previous PR was reverted because it causes the whole model to fallback when there is output shape info missing. This PR fixes the issue by removing redundant fallbacks. [Quant Tool] Flaky test due to Pad reflect bug (microsoft#22798) Fixes a unit test that would fail intermittently due to an existing bug with Pad (reflect mode). When the number of padded values is >= the inner dimension size, the ORT Pad implementation accesses invalid memory. This PR makes the number of padding values less than the inner dimension size to avoid triggering the bug. See related issues: microsoft#8265 microsoft#11828 microsoft#20801 Here's a valgrind trace obtained on a Linux machine (with `sess_options.enable_cpu_mem_arena = False`) ``` ==864228== Invalid read of size 4 ==864228== at 0x2716272A: void onnxruntime::PadInnermostAxis<unsigned int>(unsigned int*, unsigned int*, long, unsigned long) (pad.cc:370) ==864228== by 0x2715D213: onnxruntime::common::Status onnxruntime::PadImpl<unsigned int>(onnxruntime::OpKernelContext*, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, absl::lts_20240722::InlinedVector<long, 10ul, std::allocator<long> > const&, onnxruntime::Mode const&, unsigned int) (pad.cc:551) ==864228== by 0x2715B2BB: onnxruntime::Pad::Compute(onnxruntime::OpKernelContext*) const (pad.cc:725) ==864228== by 0x276FF6A7: onnxruntime::ExecuteKernel(onnxruntime::StreamExecutionContext&, unsigned long, unsigned long, bool const&, onnxruntime::SessionScope&) (sequential_executor.cc:484) ==864228== by 0x276F4A04: onnxruntime::LaunchKernelStep::Execute(onnxruntime::StreamExecutionContext&, unsigned long, onnxruntime::SessionScope&, bool const&, bool&) (execution_steps.cc:73) ... ``` The above is obtained with the basic Pad(reflect) example on the [ONNX Pad operator spec page](https://onnx.ai/onnx/operators/onnx__Pad.html#summary): ```python data = [ [1.0, 1.2], [2.3, 3.4], [4.5, 5.7], ] pads = [0, 2, 0, 0] mode = 'reflect' expected_output = [ [1.0, 1.2, 1.0, 1.2], [2.3, 3.4, 2.3, 3.4], [4.5, 5.7, 4.5, 5.7], ] ort_output = [ [inf, 1.2, 1.0, 1.2], [inf, 3.4, 2.3, 3.4], [inf, 5.7, 4.5, 5.7], ] ``` [WebNN] Fixed WebNN Module undefined issue (microsoft#22795) `Module.jsepRegisterMLConstant` will be shorten by Closure Compiler in offical release, this would cause undefined error. Fix it by using `Module['jsepRegisterMLConstant']`. Update skip layer norm (microsoft#22719) Update the `SkipLayerNorm` implementation to address issues. Change-Id: I7c6f8f860417b72cbadf49e9d66a0532317767ad register Identity and QLinearMatmul for opset21 (microsoft#22804) This PR registers the following opset 21 operators: Idenity-21 OlieanrMatmul-21 <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> [MIGraphX EP] Add support for Gelu, BiasGelu, FastGelu operators (microsoft#22808) Adds support for different flavours of gelu already supported in MIGraphX Update all JDK version to 17 (microsoft#22786) Fix LARCH64 compile error (microsoft#22759) Currently loongarch has not implemented AIsSigned qgemm, so I added bypass for it Change-Id: I793c4cbf9ed6d4d48628adc5b1333e5a86abf11e [WebNN EP] Support LRN operator (microsoft#22775) WebNN doesn't provide dedicate op for LRN, use a couple of WebNN ops to emulate it in WebNN EP: pow -> transpose -> pad -> averagePool -> transpose -> mul -> add -> pow -> div @Honry @fdwr PTAL, thanks! [js/webgpu] Optimize ConvTranspose (microsoft#22774) BUG microsoft#22031 The overall time of ConvTranspose in Demucs model becomes 517.41 ms from 1415.65 ms on my iGPUs. [js/webgpu] Optimize Expand (microsoft#22752) Use components = 4 if possible. llama3.2-1B becomes 20 tokens/s from 18 tokens/s on my iGPUs. Change-Id: I7c6f8f860417b72cbadf49e9d66a0532317767ad
This PR has been cherry-picked into the |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cherry-picked
Cherry-picked for a cherrypicks branch
triage:approved
Approved for cherrypicks for release
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Introduces the
get_qdq_config()
function to get a quantization configuration for a full integer QDQ model. This function provides an easier way of specifying commonly used options and sets convenient defaults. Specifically:extra_options
, the new interface adds function parameters for common settings:com.microsoft
domain for Q/DQ ops, which support all types.ForceQuantizeNoInputCheck
to ensure data movement operators (e.g., Transpose) are always quantized.extra_options
to override any of the above if necessary.Motivation and Context
Need a version of
get_qnn_qdq_config()
that is not EP-specific.