Releases: uxlfoundation/oneDNN
v3.8.2
This is a patch release containing the following changes to v3.8.1:
- Fixed performance regression for
f32
convolution primitive on processors with Intel AVX-512 instruction set support (5f3af68) - Introduced support for
f16
destination inint8
matmul andint8
inner product on x64 CPUs (53fd12a, 22e252c, f5b2d7f, e4e2f1c) - Improved RNN primitive performance on processors with Intel AVX2 instruction set support (71e5d81, eb27db2, dd4e627, ff134e0, 5a86c1f, e9395ae)
- Improved
fp32
matmul performance on processors with Intel AVX-512 instruction set support (1119339) - Fixed segmentation fault in
f32
binary primitive with broadcast on x64 processors (2082e98) - Fixed correctness issue in
f64
convolution weight gradient with bias on Intel Arc GPUs (a00bfab) - Updated
spdlog
component to version 1.15.3 (dbb3629) - Fixed potential undefined behavior in convolution on Intel GPUs (5ac3e31)
- Fixed segmentation fault in convolution implementation with trivial filter on Intel CPUs (908c5fc, f0a0eee)
- Fixed segmentation fault in
f16
convolution with odd dimensions on processors with Intel AVX10.1 instruction set support (78d6835) - Improved convolution primitive descriptor creation time on x64 processors (e9c5366, fd9dc58, f1d038e)
- Fixed performance regression in
f16
matmul withint4
weights on Intel Arc Graphics B-series (38d761b) - Improved
bf16
matmul performance on processors with Intel AMX instruction set support (0887aec) - Fixed correctness issue in
f32
RNN primitive on processors with Intel AMX instruction set support (460a014)
v3.10-rc
Performance Optimizations
Intel Architecture Processors
- Improved performance on future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2
. - Improved performance on future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512
. - Improved performance of matmul primitive on processors with Intel AMX support.
- Improved performance of
f32
matmul primitive for GEMV cases on on processors with Intel AVX2 instruction set support. - Improved matmul performance with
int4
andint8
compressed weights and per-channel zero-points. - Improved
f32
matmul performance withint4
andint8
compressed weights on processors with Intel AVX2 and Intel AVX512 instruction set support. - Improved
bf16
matmul performance withint4
andint8
compressed weights on processors with Intel AVX512, Intel DL Boost and bfloat16 instruction set support. - Improved performance of
int8
convolution primitive when using zero points. - Improved performance of
int8
matmul and inner product primitives withfp16
destination. - Improved performance of
f32
andbf16
convolution primitive withint8
destination. - Improved performance of RNN primitive on processors with Intel AVX2 instruction set support when using OpenMP runtime.
- Improved performance of subgraphs containing sequence of multiple binary ops with Graph API.
Intel Graphics Products
- Improved GEMM performance for small batch size on Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved matmul performance for Qwen2-7B shapes on Intel Arc graphics (formerly DG2) and Intel Arc Graphics for Intel Core Ultra processors (formerly Arrow Lake-H).
- Improved
int8
matmul performance withint4
weights and per-tensor zero-points. - Improved
bf16
matmul performance withfp8
weights. - Graph API optimizations:
- Improved Scaled Dot Product Attention (SDPA) subgraph performance for inference when relaxed accumulation mode is enabled on Intel Core Ultra processors (formerly Meteor Lake).
- Improved SDPA and GQA subgraphs performance when using host-side scalars.
- Improved performance of GQA subgraph for 2nd token scenarios.
- Improved performance of subgraphs containing sequence of multiple binary ops.
- Improved performance of Grouped Query Attention (GQA) subgraphs for training forward and backward propagation.
AArch64-based Processors
- Improved performance of reorder primitive
- Improved performance of
bf16
convolutions - Improved performance of convolutions on 128-bit SVE platforms
- Improved performance of eltwise on Arm® Neoverse™ N1
Functionality
Functional API
- Introduced host-side scalar memory objects. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported in matmul and convolution primitives on Intel GPUs.
- Introduced support for pre-computed reductions in matmul primitive. This functionality is intended to improve performance in case of
int8
activations andint8
weights with zero-point.
Graph API
- Introduced
host_scalar
property for logical tensors. This functionality allows passing host-side scalars instead of device memory objects when using oneDNN with OpenCL or SYCL runtimes. Host-side scalars are currently supported to define attention scale, sequence length, and the negative infinity value in SDPA/GQA subgraphs. - Introduced accumulation mode attribute support in
Matmul
op. This attribute allows relaxingfp32
accumulation requirements to achieve performance benefits on some platforms.
Intel Graphics Products
- Introduced support for
fp4
weights in matmul primitive. - Introduced support for grouped quantization with group size 16 in matmul with
int8
compressed weights. - Introduced support group size 16
int8
for decompressed weight with regular weights decompression.
Intel Architecture Processors
- Introduced
fp4
weights support forfp32
matmul and convolution for future Intel Xeon processors with Intel AVX10.2 instruction set support.
Usability
- Extended diagnostics available in verbose mode for primitive descriptor creation issues.
- Extended dispatch diagnostics in verbose mode output for primitives implementations on Intel GPUs.
AArch64-based Processors
- Fixed crashes in backward-pass convolutions
- Fixed numerical errors in 4D matmul primitives
- Fixed numerical errors in low-precision convolutions
- Fixed numerical errors in reorders with compensation
- Fixed illegal-instruction crashes on Arm® Neoverse™ N1
- Fixed crashes in binary primitive in Debug builds
- Fixed segmentation fault in
eltwise_log
post-ops for large kernels
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm
,dnnl::gemm_u8s8s32
, anddnnl::gemm_s8s8s32
functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
AArch64-based Processors
- Bumped the minimum required Arm® Compute Library 52.4.0
Thanks to our Contributors
This release contains contributions from the project core team as well as Andrei Hutu @Anndrey24,
Anna Sztukowska @asztukow, Arseniy Obolenskiy @aobolensk, Avanish Tiwari @Tiwari-Avanish, Daniel Kuts @apach301, Daniel Whittaker @danwhittaker-arm, Deeksha Kasture @kasturedeeksha, George Nash @georgen117, Henry Gardiner @henry-gar, Keanu Czirjak @keanucz, Krishna Sai @krishnasai-mcw, Marek Michalowski @michalowski-arm, Sheldon Robinson @sheldonrobinson, @Shreyas-fuj, Viktoriia Gvozdeva @vgvozdeva, Xiang1 Guo, Yejing Lai @Yejing-Lai, Yonghao Gu, Yusuf Butt @UseTheForce007, Zhibo Li @zhili03, @almayne, @co63oc, @focusunsink, @gassan-arm, @jstachowintel, @pmanczak, @puneetmatharu, @raistefintel, @vishwascm, @vyevtyus, @zhangfeiv0, @zhangjian29, and @xiazhuozhao.
v3.9.2
This is a patch release containing the following changes to v3.9.1:
- Fixed correctness issue in
int8
convolution on processors with Intel AVX2 and Intel DL Boost instruction set support (a7c4079, 78e781f) - Fixed performance regression for
f32
convolution primitive on processors with Intel AVX-512 instruction set support (74f23b4) - Fixed performance regression for RNN primitive with LBR GRU cell type on Intel Arc GPUs (ae2844e)
- Fixed performance regression for
int8
convolution primitive when using zero points (dbb8484) - Fixed segmentation fault in matmul primitive when using
ONEDNN_VERBOSE=all
(7310aa2) - Fixed correctness issue in multi-dimensional matmul primitive on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids) (642d18b)
- Reduced problem size in
test_sdpa_decomp
test (9bff06e) - Restricted
test_sdpa_decomp
andtest_mqa_decomp
tests toOMP
orTHREADPOOL
CPU runtimes (3cd9170) - Fixed illegal instruction issue in pooling primitive on processors with Intel SSE4.1 support (d907c47)
- Fixed segmentation fault issue in
f16
backward convolution primitive on processors with Intel AVX2 with Intel DL Boost with float16 and bfloat16 support (50cc228, fcc7e5e) - Restored support for
int8
matmul withper_oc
scales and zero points on Intel Arc GPUs (1a5a454, 04c22c9)
v3.9.1
This is a patch release containing the following changes to v3.9:
- Reduced sizes in Graph API SDPA examples (257d689)
- Fixed correctness issue in
bf16
depthwise convolution withbf16
bias on AArch64 CPUs (218b41d) - Changed Intel GPU data alignment check from error to warning (5c5008a)
- Improved
bf16
matmul performance on processors with Intel AMX instruction set support (54b6354, 30c4d8d) - Fixed PowerPC64 build by adding
-mcpu=power10
and-mmma
flags (02ca915) - Introduced support for
f16
destination inint8
matmul andint8
inner product on x64 CPUs (a62ed6b, 53c0a66, 0750043, 4f0f068) - Introduced support
per_tensor
zero-points inint8
matmul on Intel GPUs (db8e8ff, f783164, 4d458df, 80453a0, 7f90d50, a2200e2) - Fixed correctness issue in
int8
reorder for cases with compensation on x64 CPUs (771ca54)
v3.9
Performance Optimizations
Intel Architecture Processors
- Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2
. - Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512
. - Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
- Improved performance of
fp8
convolution primitive with scales andbf16
output - Improved performance of matmul primitive with post-ops on processors with Intel AMX support
- Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
- Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.
Intel Graphics Products
- Improved performance on Intel GPUs based on Xe3 architecture.
- Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved RNN primitive performance with LBR_GRU cell type.
- Improved
int8
convolution performance with plain weights and trivial filter. - Improved convolution performance with
NCHW
activations with 1x1 filter and unit strides. - Improved
fp32
softmax performance. - Improved performance of reorder when used with USM host memory.
- Improved performance of the following subgraphs with Graph API:
fp32
SDPA with implicit causal mask.fp16
SDPA on Intel GPUs without Intel XMX cores.
AArch64-based Processors
- Improved
int8
convolution performance. - Improved
bf16
depthwise convolution performance. - Improved
f16
matmul performance with Arm Compute Library (ACL).
Functionality
Functional API
- Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
- Sparse memory objects and sparse matmul are promoted to production status.
Graph API
- Introduced support for tanh approximation in
GELU
operation. - Extended Graph API
Softmax
operation to support optionalstats
output. - Introduced fusion support for SDPA training forward and backward propagation.
- Introduced fusion support for SDPA with bottom-right implicit causal mask.
- Introduced
make_scalar_tensor()
API for engine-agnostic scalar tensor creation.
Microkernel API
- Introduced support for
fp8
data type.
Intel Architecture Processors
- Introduced support for select algorithm in binary post-op.
- Introduced source, destination, and weight scales support in
fp8
convolution and deconvolution primitives.
Intel Graphics Products
- Introduced support for select algorithm in binary primitive.
Generic GPU Vendor
- Introduced support for RNN Vanilla backward propagation.
Usability
- Enabled build with
-Wundef
compiler flag. - [Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.
Validation
- Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
- Improved benchdnn graph driver performance mode via adding CPU memory pool for allocator.
Known Limitations
- The group normalization with
normalization_flags::use_scale
specified produces incorrect results for backward propagation kind in oneDNN v3.9 and earlier. - Binary primitive with certain shapes and Graph API SDPA with bottom right causal mask may hang with SYCL debug runtime on Windows.
fp8
matmul primitive may sporadically produce incorrect results on Intel Arc B-series graphics.int8
inner product primitive with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Сenter GPU Max series.bf16
pooling with tensors exceeding 4 Gb in size may produce incorrect results on Intel Data Сenter GPU Max series.bf16
/fp16
matmul with large inner dimension has a performance regression on Intel Data Сenter GPU Max Series.bf16
/fp16
convolution withNCHW
activations has a performance regression on Intel Data Сenter GPU Max Series.- Softmax with non-trivial strides and blocked format may produce incorrect results.
bf16
layer normalization backpropagation may produce incorrect results on Intel Data Сenter GPU Max Series.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm
,dnnl::gemm_u8s8s32
, anddnnl::gemm_s8s8s32
functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.
v3.9-rc
Performance Optimizations
Intel Architecture Processors
- Introduced initial support for future Intel Xeon processors with Intel AVX 10.2 and Intel AMX instruction sets support.
This functionality is not dispatched by default and requires opt-in with environment variableONEDNN_MAX_CPU_ISA=AVX10_2_512_AMX_2
. - Introduced initial support for future Intel Core processors with Intel AVX 10.2 instruction set support. This functionality is not dispatched by default and requires opt-in with environment variable
ONEDNN_MAX_CPU_ISA=AVX10_2_512
. - Improved initialization time for convolution primitive when a large number of threads is used by introducing a new thread partition estimation and adjusting several blocking parameters.
- Improved performance of
fp8
convolution primitive with scales andbf16
output - Improved performance of matmul primitive with post-ops on processors with Intel AMX support
- Improved performance of RNN primitive for LBR_GRU and VANILLA_LSTM cell types on processors with Intel AVX2 instruction set support
- Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with implicit causal mask.
- Grouped Query Attention (GQA) flavor specific for GEMMA models.
Intel Graphics Products
- Improved performance on Intel GPUs based on Xe3 architecture.
- Improved matmul performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
- Improved RNN primitive performance with LBR_GRU cell type.
- Improved
int8
convolution performance with plain weights and trivial filter. - Improved convolution performance with
NCHW
activations with 1x1 filter and unit strides. - Improved
fp32
softmax performance. - Improved performance of reorder when used with USM host memory.
- Improved performance of the following subgraphs with Graph API:
- SDPA with implicit causal mask.
- SDPA with bottom-right implicit causal mask.
fp32
SDPA.fp16
SDPA on Intel GPUs without Intel XMX cores.
AArch64-based Processors
- Improved
int8
convolution performance. - Improved
bf16
depthwise convolution performance. - Improved
f16
matmul performance with Arm Compute Library (ACL).
Functionality
Functional API
- Introduced Root Mean Square Normalization (RMSNorm) mode for layer normalization primitive. This functionality is optimized for Intel CPUs and Intel GPUs.
- Sparse memory objects and sparse matmul are promoted to production status.
Graph API
- Introduced support for tanh approximation in
GELU
operation. - Extended Graph API
Softmax
operation to support optionalstats
output. - Introduced support for SDPA training forward propagation and backpropagation.
Microkernel API
- Introduced support for
fp8
data type.
Intel Architecture Processors
- Introduced support for select algorithm in binary post-op.
- Introduced source, destination, and weight scales support in
fp8
convolution and deconvolution primitives.
Intel Graphics Products
- Introduced support for select algorithm in binary primitive.
Generic GPU Vendor
- Introduced support for RNN Vanilla backward propagation.
Usability
- Enabled build with
-Wundef
compiler flag. - [Experimental] Introduced support for kernel compilation with SYCL kernel compiler extension.
Validation
- Improved benchdnn performance by optimizing input data filling and testing results comparison steps.
Known Limitations
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm
,dnnl::gemm_u8s8s32
, anddnnl::gemm_s8s8s32
functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, @Anallear, Anna Sztukowska @asztukow, Avanish Tiwari @Tiwari-Avanish, Dmitriy Ovchinnikov @inteldimitrius, Kasture Deeksha, Krishna Sai @krishnasai-mcw, Manaal @manaalmj, Marek Michalowski @michalowski-arm, Orel Yehuda @yehudaorel, Ruqiu Cao @rcao8, Tsao Zhong @CaoZhongZ, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, Ye Tao @taoye9, Yuanyuan Chen @cyyever, @gausah-arm, @karmeh01, @pmanczak, and @zhangfeiv0. We would also like to thank everyone who asked questions and reported issues.
v3.8.1
This is a patch release containing the following changes to v3.8:
- Fixed correctness issue in reorder primitive with non-trivial strides on Intel CPUs (a762d32)
- Fixed runtime error in convolution weight gradient on Xe2 architecture-based Intel GPUs (a8fac73, c409ef9)
- Fixed performance regression in
bf16
convolution on Intel Datacenter GPU Max Series (98170d0, c6bae4a, c5edd53, bb1a591) - Improved performance of
fp16
matmul withfp8
compressed weights on Intel GPUs (58f3ec1, abff176, ffd7dd3, 3b1e855, 2e140de, 3429f79) - Fixed runtime error in
fp16
pooling primitive on Xe2 architecture based Intel GPUs (c0f6b6d) - Improved performance of
fp16
matmul withint4
weights and32 < m <= 64
on Intel GPUs (2fa7072) - Fixed correctness issues in
bf16
matmul with 3 or more dimensional tensors on processors with Intel AMX support (dd20965, ea1b4a1) - Fixed performance regression in
fp16
orbf16
matmul with transposed source and weight tensors on Intel Datacenter GPU Max Series (e45e1aa) - Improved performance of
bf16
matmul withint4
weights on Intel GPUs (7a15c23) - Fixed runtime error in
fp16
SDPA subgraph with head size512
on Intel Core Ultra (Series 2) processor integrated GPU (bde6985)
v3.8
Performance Optimizations
Intel Architecture Processors
- Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
- Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
- Improved performance of
int8
convolution support with zero points. - Improved
fp32
convolution performance withfp16
andbf16
compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support. - Improved
fp16
/bf16
depthwise convolution performance withfp32
bias orsum
post-ops or dilation. - Improved
bf16
pooling backpropagation performance. - Improved binary post-ops performance with
per_w
broadcast.
Intel Graphics Products
- Improved performance on Intel Arc graphics for future Intel Core Ultra processors (code name Panther Lake).
- Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra processor series 2 (formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
- Improved
int8
matmul performance with zero-points support for source and weight tensors. - Improved
f4_e2m1
andf4_e3m0
matmul and reorder performance. - Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with
int4
andint8
compressed key and value. fp16
/bf16
SDPA withfp32
intermediate data types. Usingfp32
intermediate data types is recommended.- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.
- Scaled Dot Product Attention (SDPA) with
AArch64-based Processors
- Improved
fp16
reorder performance. - Improved
int8
matmul performance. - Improved
bf16
inner product forward propagation performance with Arm Compute Library (ACL). - Improved
bf16
eltwise performance. - Improved convolution performance on processors with SVE support with ACL.
Functionality
Common
- Extended Graph API
Softmax
operation to supportinf_as_zero
mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.
Intel Architecture Processors
- Introduced support for
f32
convolution withfp16
compressed weights. - Enabled
int8
/int4
compressed weights support in matmul primitive.
Intel Graphics Products
- Introduced select algorithm support in binary primitive.
- Introduced support for
f4_e2m1
andf4_e3m0
data types in convolution primitive. - Introduced support for the GenIndex operation in Graph API.
Generic GPU Vendor
- Introduced support for:
- Vanilla RNN forward propagation.
- Inner product backpropagation.
- Group normalization.
- Improved accuracy of inner product primitive with sum post-ops for large shapes.
NVIDIA GPUs
- Introduced Graph API support.
Usability
- Added support for group normalization primitive with
ONEDNN_ENABLE_PRIMITIVE
build option. - Enabled support for ROCm 6 on AMD GPUs.
- Improved CMake integration for oneDNN installation with Nvidia backend enabled.
- Reduced memory footprint for matmul primitive when using ACL.
Validation
- Added benchdnn option
--execution-mode
to test oneDNN functionality with SYCL Graph record/execute mode. - Extended benchdnn option
--cold-cache
with support for cold TLB mode. - Added benchdnn option
--bia-dt
to control bias data type for matmul, inner product, convolution, and deconvolution primitives. - Extended syntax of benchdnn
--dt
option in Graph API driver to manage data types of individual tensors in a pattern.
Deprecated Functionality
- BLAS-like API including
dnnl::sgemm
,dnnl::gemm_u8s8s32
, anddnnl::gemm_s8s8s32
functions is deprecated and will be removed in future releases. If you are using this API consider switching to matmul primitive.
Breaking Changes
- Removed the experimental Graph Compiler backend for Graph API.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.
v3.7.3
This is a patch release containing the following changes to v3.7.2:
- Fixed correctness issue in matmul with non-trivial strides for the first tensor on processors with Intel AMX instruction set support (e18c622)
- Removed spurious warning messages for SDPA subgraph on Intel GPUs (05541bb, 9e9a3a6)
- Fixed segfault in
fp32
matmul withbf16
math mode on processors with Intel AVX2 instruction set support (7d495ae) - Fixed performance regression in
bf16
3D convolution backpropagation on processors with Intel AVX-512 and Intel DL Boost instruction set support (c38e02c, 67afc74) - Worked around GCC 12.3 bug causing accuracy issues in
fp8
functionality on Intel GPUs (69b38d7) - Removed
-fcf-protection
build option for GCC 7 and earlier versions (813725d)
v3.8-rc
Performance Optimizations
Intel Architecture Processors
- Improved matmul and inner product primitives performance on processors with Intel AMX instruction set support.
- Improved performance of convolution and inner product primitives on processors with Intel AVX2 instruction set support.
- Improved performance of
int8
convolution support with zero points. - Improved
fp32
convolution performance withfp16
andbf16
compressed weights on processors with Intel AVX2 or Intel AVX-512 instruction set support. - Improved
fp16
/bf16
depthwise convolution performance withfp32
bias orsum
post-ops or dilation. - Improved
bf16
pooling backpropagation performance. - Improved binary post-ops performance with
per_w
broadcast.
Intel Graphics Products
- Improved performance on Intel GPUs based on Xe3 architecture.
- Improved convolution performance on:
- Intel Arc Graphics for Intel Core Ultra (Series 2, formerly Lunar Lake).
- Intel Arc B-series discrete graphics (formerly Battlemage).
- Improved
int8
matmul performance with zero-points support for source and weight tensors. - Improved
f4_e2m1
andf4_e3m0
matmul and reorder performance. - Improved performance of the following subgraphs with Graph API:
- Scaled Dot Product Attention (SDPA) with
int4
andint8
compressed key and value. fp16
/bf16
SDPA withfp32
intermediate data types. Usingfp32
intermediate data types is recommended.- SDPA with head size 512 and 576.
- Grouped Query Attention (GQA) with 5D input tensors.
- Scaled Dot Product Attention (SDPA) with
AArch64-based Processors
- Improved
fp16
reorder performance. - Improved
int8
matmul performance. - Improved
bf16
inner product forward propagation performance with Arm Compute Library (ACL). - Improved convolution performance on processors with SVE support with ACL.
Functionality
Common
- Extended Graph API
Softmax
operation to supportinf_as_zero
mode. This functionality enables SDPA subgraph compliant with Pytorch Safe Softmax semantics.
Intel Architecture Processors
- Introduced support for
f32
convolution withfp16
compressed weights. - Enabled
int8
/int4
compressed weights support in matmul primitive.
Intel Graphics Products
- Introduced select algorithm support in binary primitive.
- Introduced support for
f4_e2m1
andf4_e3m0
data types in convolution. - Introduced support for the GenIndex operation in Graph API.
Generic GPU Vendor
- Introduced support for:
- Vanilla RNN forward propagation
- Inner product backpropagation
- Group normalization
- Improved accuracy of inner product primitive with sum post-ops for large shapes.
NVIDIA GPUs
- Introduced Graph API support.
Usability
- Added support for Group Normalization primitive with
ONEDNN_ENABLE_PRIMITIVE
build option. - Enabled support for ROCm 6 on AMD GPUs.
- Improved CMake integration for oneDNN installation with Nvidia backend enabled.
- Reduced memory footprint for matmul primitive when using ACL.
Validation
- Added benchdnn option
--execution-mode
to test oneDNN functionality with SYCL Graph record/execute mode. - Extended benchdnn option
--cold-cache
with support for cold TLB mode. - Added benchdnn option
--bia-dt
to control bias data type for matmul, inner product, convolution, and deconvolution. - Extended syntax of benchdnn
--dt
option in Graph API driver to manage data types of individual tensors in a pattern.
Breaking Changes
- Removed the experimental Graph Compiler backend for Graph API.
Thanks to our Contributors
This release contains contributions from the project core team as well as Aditya Tewari @aditew01, Alexander Simonov @asimonov1, Denis @redradist, Dmitriy Ovchinnikov @inteldimitrius, Eliezer Weissmann @eliezerweissmann, Hubert Maciak @hmaciak, Ilya Lavrenov @ilya-lavrenov, James McGregor @Jmc18134, @jstachowintel, Marek Michalowski @michalowski-arm, Maria Zhukova @mzhukova, Orel Yehuda @yehudaorel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, @Shreyas-fuj, Shu Chen @shu1chen, Viktoriia Gvozdeva @vgvozdeva, Yair Obodovsky @yair-obodovsky, and @zhangfeiv0.