Data Streaming Accelerator User Guide 2
Data Streaming Accelerator User Guide 2
User Guide
Ref#: 353216-002US
January 2023
Notices & Disclaimers
Ref#:353216.002US ii
REVISION HISTORY
Date Revision Description
November 2022 001 Initial release of document
Reference to GitHub added (Section 3.2.3)
January 2023 002 Replaced section of incorrect code
(Appendix B Example 2)
Ref#:353216.002US iii
TABLE OF CONTENTS
CHAPTER 1
INTRODUCTION
1.1 AUDIENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.2 GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.3 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.4 DOCUMENT ORGANIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
CHAPTER 2
PLATFORM CONFIGURATION
2.1 BIOS CONFIGURATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.2 LINUX KERNEL CONFIGURATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.2.1 Intel® IOMMU Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1
2.2.2 Intel® DSA Driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1
CHAPTER 3
INTEL® DSA CONFIGURATION
3.1 INTEL® DSA DEVICE ENUMERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.1 PCI Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1
3.1.2 Sysfs Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
3.2 DEVICE CONFIGURATION AND CONTROL INTERFACES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3.2.1 Intel® DSA WQs/Engines/Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
3.2.2 Linux Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
3.2.3 accel-config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
3.2.4 WQ Device File Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
CHAPTER 4
INTEL® DSA PROGRAMMING
4.1 SAMPLE LINUX APPLICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.1.1 Descriptor Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1
4.1.2 Descriptor Submission Portal Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
4.1.3 Descriptor Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
4.1.4 Completion Polling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
4.1.5 Partial Completion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
4.2 PROGRAMMING CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4.2.1 Ordering/Fencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-5
4.2.2 Destination Address in Persistent Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-5
4.3 LIBRARY SUPPORT FOR INTEL® DSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
CHAPTER 5
INTEL® DSA PERFORMANCE MICROS (DSA_PERF_MICROS)
5.1 DEFINITION AND REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
CHAPTER 6
INTEL® DSA PERFORMANCE COUNTERS
6.1 PERFORMANCE COUNTER REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
APPENDIX A
ACCEL-CONFIG EXAMPLES
A.1 STEPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
APPENDIX B
C FUNCTIONS FOR GCC VERSIONS WITHOUT
MOVDIRB64/ENQCMD/UMWAIT/UMONITOR SUPPORT
B.1 ABOUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
Ref#:353216.001US i
TABLE OF CONTENTS
APPENDIX C
SAMPLE C PROGRAM
C.1 STEPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
APPENDIX D
ACTIONS FOR CONTINUATION AFTER PAGE FAULT
D.1 DESCRIPTION AND TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1
APPENDIX E
DEDICATED AND SHARED WQ COMPARISON
E.1 DESCRIPTION AND TABLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1
APPENDIX F
DEBUG AIDS FOR CONFIGURATION ERRORS
F.1 LIST OF DEBUG AIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1
Ref#:353216.001US ii
TABLES
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i-iii
Table 1-1. Acronym Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
Table 1-2. References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Table 2-1. Linux Operating System Vendor Intel DSA driver support . . . . . . . . . . . . . . . . . . . . . . 2-2
Table 4-1. Libraries with Support for Intel® DSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Table 5-1. dsa_perf-micros Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
Table D-1. SW Actions for Continuation After Page Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1
Table E-1. Dedicated and Shared WQ Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1
Ref#:353216.001US 1
FIGURES
Figure 2-1. Linux Kernel Configuration Options for Intel® IOMMU driver . . . . . . . . . . . . . . . . . . . . . 2-1
Figure 2-2. Linux Kernel Configuration Options for Intel® DSA driver . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Figure 2-3. IDXD driver initialization messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Figure 3-1. Intel® DSA Logical Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
Figure 3-2. Listing all Intel® DSA devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Figure 3-3. lspci output for an Intel® DSA device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Figure 3-4. SVM Capabilities and Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Figure 3-5. sysfs SVM Capability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Figure 3-6. Intel® DSA sysfs Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Figure 3-7. Intel® DSA Device/Group/Engine/WQ configuration and control sysfs entries . . . . . 3-6
Figure 3-8. Profiles included in accel-config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
Figure 3-9. Accel-config command line with WQ configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Figure 3-10. Using accel-config to verify device configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Figure 3-11. WQ device files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Figure 4-1. Descriptor Processing Sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Figure F-1. Linux kernel ACPI subsystem messages when VT-d is enabled . . . . . . . . . . . . . . . . . . F-1
Ref#:353216.001US i
EXAMPLES
Example 4-1. Descriptor Initialization 4-2
Example 4-2. Descriptor Submission 4-3
Example 4-3. Descriptor Completion Check 4-3
Example 4-4. Descriptor Completion Check with Pause 4-3
Example 4-5. UMONITOR/UMWAIT sequence to reduce power consumption while polling 4-4
Example B-1. MOVDIR64B B-1
Example B-2. ENQCMD B-1
Example B-3. UMWAIT B-1
Example B-4. UMONITOR B-2
Example C-1- Intel® DSA Shared WQ Sample Application C-1
Ref#:353216.001US ii
INTRODUCTION
CHAPTER 1
INTRODUCTION
1.1 AUDIENCE
Intel® DSA is a high-performance data copy and transformation accelerator integrated into Intel®
processors starting with 4th Generation Intel® Xeon® processors. It is targeted for optimizing streaming
data movement and transformation operations common with applications for high-performance storage,
networking, persistent memory, and various data processing applications.
This document’s intended audience includes system administrators who may need to configure Intel DSA
devices and developers who want to enable Intel DSA support in applications and use libraries that
provide interfaces to Intel DSA. It should be read in conjunction with the Intel® DSA Architecture Speci-
fication and documentation for SW utilities and libraries that support Intel DSA, such as accel-
config/libaccel-config, Libfabric, and Intel® MPI.
1.2 GLOSSARY
Ref#:353216.002US 1-1
INTRODUCTION
1.3 REFERENCES
Table 1-2. References
Description URL
https://software.intel.com/en-us/download/intel-data-streaming-
Intel® DSA Architecture Specification
accelerator-preliminary-architecture-specification
https://software.intel.com/content/www/us/en/develop/down-
Intel® Architecture Instruction Set Extensions
load/intel-architecture-instruction-set
Programming Reference
extensions-programming-reference.html
https://www.intel.com/content/www/us/en/devel-
Intel® MPI
oper/tools/oneapi/mpi-library.html
https://01.org/blogs/2020/pedal-metal-accelerator-configuration-
accel-config (01.org)
and-control-open-source
Libfabric
Libfabric Intel DSA support will be in the Libfab-
https://github.com/ofiwg/libfabric/blob/main/man/fi_shm.7.md
ric SHM provider in libfabric version 1.17.0, tar-
geted for release in Nov 2022.
Ref#:353216.002US 1-2
INTRODUCTION
Ref#:353216.002US 1-3
PLATFORM CONFIGURATION
CHAPTER 2
PLATFORM CONFIGURATION
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_SVM=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON=y
Figure 2-1. Linux Kernel Configuration Options for Intel® IOMMU driver
CONFIG_INTEL_IDXD=m
CONFIG_INTEL_IDXD_SVM=y
CONFIG_INTEL_IDXD_PERFMON=y
Figure 2-2. Linux Kernel Configuration Options for Intel® DSA driver
Work queues (WQs) are on-device storage to contain descriptors submitted to the device and can be
configured to run in either of two modes Dedicated (DWQ) or Shared (SWQ). A SWQ allows multiple
clients to submit descriptors simultaneously without the software overhead of synchronization needed to
track WQ occupancy. SWQ is the preferred WQ mode since it offers better device utilization versus hard
partitioning with DWQs which may result in under utilization. The Intel® DSA Driver (IDXD) with DWQ
support was introduced in kernel version 5.6. The IDXD driver with SWQ support is available in Linux
upstream Kernel versions 5.18 and beyond.
Ref#:353216.002US 2-1
PLATFORM CONFIGURATION
IDXD driver initialization can be checked using the dmesg command to print the kernel message buffer,
as shown in Figure 2-3.
$ dmesg | grep "idxd "
idxd 0000:6a:01.0: enabling device (0144 -> 0146)
idxd 0000:6a:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:6a:02.0: enabling device (0140 -> 0142)
idxd 0000:6a:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:6f:01.0: enabling device (0144 -> 0146)
idxd 0000:6f:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:6f:02.0: enabling device (0140 -> 0142)
idxd 0000:6f:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:74:01.0: enabling device (0144 -> 0146)
idxd 0000:74:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:74:02.0: enabling device (0140 -> 0142)
idxd 0000:74:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:79:01.0: enabling device (0144 -> 0146)
idxd 0000:79:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:79:02.0: enabling device (0140 -> 0142)
idxd 0000:79:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:e7:01.0: enabling device (0144 -> 0146)
idxd 0000:e7:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:e7:02.0: enabling device (0140 -> 0142)
idxd 0000:e7:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:ec:01.0: enabling device (0144 -> 0146)
idxd 0000:ec:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:ec:02.0: enabling device (0140 -> 0142)
idxd 0000:ec:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:f1:01.0: enabling device (0144 -> 0146)
idxd 0000:f1:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:f1:02.0: enabling device (0140 -> 0142)
idxd 0000:f1:02.0: Intel(R) Accelerator Device (v100)
idxd 0000:f6:01.0: enabling device (0144 -> 0146)
idxd 0000:f6:01.0: Intel(R) Accelerator Device (v100)
idxd 0000:f6:02.0: enabling device (0140 -> 0142)
idxd 0000:f6:02.0: Intel(R) Accelerator Device (v100)
Distribution kernel versions with complete IDXD driver support are shown in Table 2-1. Please refer to
vendor documentation for the latest information.
Table 2-1. Linux Operating System Vendor Intel DSA driver support
Ref#:353216.002US 2-2
INTEL® DSA CONFIGURATION
CHAPTER 3
INTEL® DSA CONFIGURATION
This section describes how Intel® DSA devices and WQs can be configured and enabled by a superuser
before running an application that uses Intel DSA. Before describing the configuration process, Linux OS
structures for Intel DSA are described to help debug configuration issues.
Ref#:353216.002US 3-1
INTEL® DSA CONFIGURATION
Intel DSA PCI device ID is 0x0b25. The following command lists the Intel DSA devices on the system:
The complete lspci output for an Intel DSA device can be obtained, as shown in Figure 3-3. If the Kernel
driver in use field within the lspci output is blank, use the modprobe idxd command to load the driver.
Ref#:353216.002US 3-2
INTEL® DSA CONFIGURATION
/* 2 of 3 */
Capabilities: [80] MSI-X: Enable+ Count=9 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [90] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC-
UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC-
UnsupReq+ ACSViol-
UESvrt: DLP- SDES- TLP+ FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC-
UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [160 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Capabilities: [170 v1] Virtual Channel
Caps: LPEVC=1 RefClk=100ns PATEntryBits=1
Arb: Fixed+ WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=fd
Status: NegoPending- InProgress-
VC1: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=1 ArbSelect=Fixed TC/VC=02
Ref#:353216.002US 3-3
INTEL® DSA CONFIGURATION
/* 3 of 3 */
Shared Virtual Memory (SVM) is a usage where a device operates in the CPU virtual address space of the
application accessing the device. Devices supporting SVM do not require pages that are accessed by the
device to be pinned. Instead, they use the PCI Express Address Translation Services (ATS) and Page
Request Services (PRS) capabilities to implement recoverable device page faults. Devices supporting
SVM use PASIDs to distinguish different application virtual address spaces.
PCIe capabilities and status related to SVM – ATSCtl, PASIDCtl, and PRICtl are enabled, as shown in
Figure 3-4. Refer to the Address Translation section within the Intel® DSA Architecture Specification for
further details on how Intel DSA utilizes the PASID, PCIe, ATS, and PRS capabilities to support SVM.
Ref#:353216.002US 3-4
INTEL® DSA CONFIGURATION
$ cat /sys/bus/dsa/devices/dsa0/pasid_enabled
1
$ ls -df /sys/bus/dsa/devices/dsa*
/sys/bus/dsa/devices/dsa0 /sys/bus/dsa/devices/dsa2
/sys/bus/dsa/devices/dsa10 /sys/bus/dsa/devices/dsa4
/sys/bus/dsa/devices/dsa12 /sys/bus/dsa/devices/dsa6
/sys/bus/dsa/devices/dsa14 /sys/bus/dsa/devices/dsa8
Ref#:353216.002US 3-5
INTEL® DSA CONFIGURATION
$ ls /sys/bus/dsa/devices/dsa0
Figure 3-7. Intel® DSA Device/Group/Engine/WQ configuration and control sysfs entries
3.2.3 accel-config
accel-config is a Linux application that provides a command line interface for Intel DSA configuration.
The accel-config application and library can be installed from https://github.com/intel/idxd-config or your distribu-
tion’s package manager. It links to a shared library (libaccel-config.so) that applications can use to query
and modify Intel DSA configuration. A detailed description is available on the 01.org website:
https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator.
accel-config can be used with text-based configuration files. Recommended configurations for a few use
cases are included in the accel-config installation.
app_profile.conf is a configuration intended for user space applications and provides two groups with
one SWQ and one engine each. The WQs are configured so that applications desiring to use Intel DSA for
operations with a relatively small memory footprint can submit descriptors to the WQ with a smaller
value of maximum transfer size configured for that WQ. This avoids head-of-line blocking, i.e., prevents
Ref#:353216.002US 3-6
INTEL® DSA CONFIGURATION
these operations from queuing. Figure 3-9 shows how to configure and enable WQs using app_pro-
file.conf. A super-user must execute this command since only a super-user can modify sysfs entries.
A command line example for enabling an Intel DSA WQ with a custom configuration and saving the
configuration to a file is shown in Appendix A.
accel-config can be show the current configuration using the list command, as shown in Figure 3-10.
$ accel-config list
The super-user must grant read-write permissions to the device file to the user/group under which the
process runs.
Ref#:353216.002US 3-7
INTEL® DSA PROGRAMMING
CHAPTER 4
INTEL® DSA PROGRAMMING
A user can start an application that uses Intel® DSA once the superuser has configured an Intel DSA
device and at least one associated WQ and enabled the user’s access to the WQ character device file (as
described in section Section 3.2). The commands used to configure the device and a shared WQ are
provided in Appendix A.
In this section, we walk through C program snippets to illustrate the steps needed to use Intel DSA.
Complete source code listing for a C program that uses Intel DSA is provided in Appendix C.
Figure 4-1 shows the steps from descriptor preparation to descriptor completion. Each step is discussed
in further detail within respective sub-sections.
Ref#:353216.002US 4-1
INTEL® DSA PROGRAMMING
descriptor incurs a page fault on either source or destination addresses, the operation status code indi-
cates that the operation has completed with a page fault. The number of bytes transferred for the
memmove operation is provided in the completion record. Please refer to Section 4.1.5 for details on the
Block on Fault flag.
desc.opcode = DSA_OPCODE_MEMMOVE;
/*
* Request a completion – since we poll on status, this flag
* must be 1 for status to be updated on successful
* completion
*/
desc.flags = IDXD_OP_FLAG_RCR;
desc.xfer_size = BLEN;
desc.src_addr = (uintptr_t)src;
desc.dst_addr = (uintptr_t)dst;
comp.status = 0;
desc.completion_addr = (uintptr_t)∁
Ref#:353216.002US 4-2
INTEL® DSA PROGRAMMING
Since MOVDIR64B and ENQCMD are not ordered relative to older stores to WB or WC memory, SW must
ensure appropriate ordering (when required) by executing a fencing instruction such as SFENCE, prefer-
ably using a single fence for multiple updates to reduce the fencing instruction overhead.
#include <x86gprintrin.h>
_mm_sfence();
if (dedicated)
_movdir64b(wq_portal, &desc);
else {
retry = 0;
while (_enqcmd(wq_portal, &desc) && retry++ < ENQ_RETRY_MAX);
}
retry = 0;
while (comp.status == 0 && retry++ < COMP_RETRY_MAX);
if (comp.status == DSA_COMP_SUCCESS) {
/* Successful completion */
} else {
/* Descriptor failed or timed out
* See the “Error Codes” section of the Intel® DSA Architecture Specification for
* error code descriptions
*/
}
A pause instruction should be added to the spin loop to reduce the power consumed by a processor.
#include <x86gprintrin.h>
retry = 0;
while (comp.status == 0 && retry++ < COMP_RETRY_MAX)
__mm_pause();
Further power reduction can be achieved using the UMONITOR/UMWAIT instruction sequence.
UMONITOR provides an address, informing that the currently running application is interested in any
Ref#:353216.002US 4-3
INTEL® DSA PROGRAMMING
writes to a range of memory (the range that the monitoring hardware checks for store operations can be
determined by using the CPUID monitor leaf function, EAX=05H).
UMWAIT instructs the processor to enter an implementation-dependent optimized state while moni-
toring a range of addresses. The optimized state may be either a light-weight power/performance opti-
mized state or an improved power/performance optimized state. The selection between the two states is
governed by the explicit input register bit[0] source operand.
#include <x86gprintrin.h>
/*
* C0.2 Improves performance of the other SMT thread(s)
* on the same core, and has larger power savings
* but has a longer wakeup time.
*/
#define UMWAIT_STATE_C0_2 0
#define UMWAIT_STATE_C0_1 1
retry = 0;
while (comp.status == 0 && retry++ < MAX_COMP_RETRY) {
_umonitor(&comp);
if (comp.status == 0) {
uint64_t delay = __rdtsc() + UMWAIT_DELAY;
_umwait(UMWAIT_STATE_C0_1, delay);
}
}
Ref#:353216.002US 4-4
INTEL® DSA PROGRAMMING
partially complete. The completion record reports the faulting address and the number of bytes
processed completely. The application can choose between completing the operation in software and
resubmitting the operation to Intel DSA after modifying the descriptor as necessary, e.g., for a memmove
descriptor, SW can touch the faulting address reported in the completion record and resubmit the opera-
tion after updating the source address, destination address and transfer size fields in the descriptor.
Please refer to Appendix D for further information on resubmitting descriptors for other operations.
To maximize the utilization of the device, provide equitable BW allocation when configured as a shared
device, and provide comparatively better execution predictability, it is recommended to configure the WQ
with Block On Fault disabled.
4.2.1 Ordering/Fencing
Applications may need to guarantee ordering in descriptor execution. Please refer to the Ordering and
Fencing section within the Intel® DSA Architecture Specification for details on conditions under which
ordering is guaranteed and the utility of the fence flag in descriptors within a batch.
Intel has also developed an open source library Intel Data Movement Library (DML) providing both a
low-level C and high-level C++ API for data processing using Intel DSA and software path in case Intel
Intel® DML DSA is not available. The DML also includes sample applications that can help quickly enable support
for Intel DSA in applications.
https://intel.github.io/DML/
Libfabric will include support for Intel DSA within its shared memory provider in the libfabric version
Libfabric 1.17.0, targeted for release in Nov 2022; this enables Intel DSA usage in HPC applications that use
the Intel MPI, MPICH, OpenMPI, and MVAPICH libraries.
Intel MPI includes support for Intel DSA since version 2021.7; instructions for enabling Intel DSA for
Intel® MPI
the shm transport used for intra-node communication are available in the Intel MPI documentation.
The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing
high-performance, scalable, user-mode storage applications.
SPDK
SPDK support for Intel DSA is described at:
https://spdk.io/doc/idxd.html.
DPDK is the Data Plane Development Kit with libraries to accelerate packet processing workloads
DPDK running on various CPU architectures. DPDK support for Intel DSA is described at
http://doc.dpdk.org/guides/dmadevs/idxd.html.
Ref#:353216.002US 4-5
INTEL® DSA PERFORMANCE MICROS (DSA_PERF_MICROS)
CHAPTER 5
INTEL® DSA PERFORMANCE MICROS (dsa_perf_micros)
Description URL
Ref#:353216.002US 5-1
INTEL® DSA PERFORMANCE COUNTERS
CHAPTER 6
INTEL® DSA PERFORMANCE COUNTERS
Parameters that can be specified for Intel DSA are listed in sysfs.
$ ls /sys/bus/event_source/devices/dsa0/format
event event_category filter_eng filter_pgsz filter_sz filter_tc filter_wq
A single event can be read every 1s with the -I flag using the command syntax below.
Multiple events can be configured for a counter and each event can be constrained by a set of filters.
Examples of filters are WQ, Engine, Traffic Class, Transfer Size. Below is a command line with multiple
events configured for a single counter and filtered by 4KB ≤ transfer size < 16KB.
Ref#:353216.002US 6-1
ACCEL-CONFIG EXAMPLES
APPENDIX A
ACCEL-CONFIG EXAMPLES
A.1 STEPS
1. Configure Device
Ref#:353216.002US A-1
C FUNCTIONS FOR GCC VERSIONS WITHOUT MOVDIRB64/ENQCMD/UMWAIT/UMONITOR SUPPORT
APPENDIX B
C FUNCTIONS FOR GCC VERSIONS WITHOUT
MOVDIRB64/ENQCMD/UMWAIT/UMONITOR SUPPORT
B.1 ABOUT
GCC supports the ENQCMD and MOVDIR64B since the gcc10 release with the -menqcmd and
-movdir64b switches, respectively. UMONITOR and UMWAIT instructions have been supported since the
gcc9 release with -mwaitpkg switch.
Ref#:353216.002US B-1
C FUNCTIONS FOR GCC VERSIONS WITHOUT MOVDIRB64/ENQCMD/UMWAIT/UMONITOR SUPPORT
Ref#:353216.002US B-2
SAMPLE C PROGRAM
APPENDIX C
SAMPLE C PROGRAM
C.1 STEPS
• Install the accel-config library from https://github.com/intel/idxd-config or your distribution’s
package manager.
• Configure Shared WQ using the example in Appendix A. Assuming the source file is intel_dsa_sample.c.
Use the command below to compile.
/* Page 1/4 */
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <linux/idxd.h>
#include <accel-config/libaccel_config.h>
#include <x86intrin.h>
static uint8_t
op_status(uint8_t status)
{
return status & DSA_COMP_STATUS_MASK;
}
Ref#:353216.002US C-1
SAMPLE C PROGRAM
/* Page 2/4 */
static void *
map_wq(void)
{
void *wq_portal;
struct accfg_ctx *ctx;
struct accfg_wq *wq;
struct accfg_device *device;
char path[PATH_MAX];
int fd;
int wq_found;
accfg_new(&ctx);
accfg_device_foreach(ctx, device) {
if (wq_found)
break;
}
if (wq_found)
break;
}
accfg_unref(ctx);
if (!wq_found)
return MAP_FAILED;
fd = open(path, O_RDWR);
if (fd < 0)
return MAP_FAILED;
return wq_portal;
}
Ref#:353216.002US C-2
SAMPLE C PROGRAM
/* Page 3/4 */
wq_portal = map_wq();
if (wq_portal == MAP_FAILED)
return EXIT_FAILURE;
desc.opcode = DSA_OPCODE_MEMMOVE;
/*
* Request a completion – since we poll on status, this flag
* must be 1 for status to be updated on successful
* completion
*/
desc.flags = IDXD_OP_FLAG_RCR;
desc.xfer_size = BLEN;
desc.src_addr = (uintptr_t)src;
desc.dst_addr = (uintptr_t)dst;
desc.completion_addr = (uintptr_t)∁
Ref#:353216.002US C-3
SAMPLE C PROGRAM
/* Page 4/4 */
retry:
comp.status = 0;
enq_retry = 0;
while (enqcmd(wq_portal, &desc) && enq_retry++ < ENQ_RETRY_MAX) ;
if (enq_retry == ENQ_RETRY_MAX) {
printf("ENQCMD retry limit exceeded\n");
rc = EXIT_FAILURE;
goto done;
}
poll_retry = 0;
while (comp.status == 0 && poll_retry++ < POLL_RETRY_MAX)
_mm_pause();
if (poll_retry == POLL_RETRY_MAX) {
printf("Completion status poll retry limit exceeded\n");
rc = EXIT_FAILURE;
goto done;
}
if (comp.status != DSA_COMP_SUCCESS) {
if (op_status(comp.status) == DSA_COMP_PAGE_FAULT_NOBOF) {
int wr = comp.status & DSA_COMP_STATUS_WRITE;
volatile char *t;
t = (char *)comp.fault_addr;
wr ? *t = *t : *t;
desc.src_addr += comp.bytes_completed;
desc.dst_addr += comp.bytes_completed;
desc.xfer_size -= comp.bytes_completed;
goto retry;
} else {
printf("desc failed status %u\n", comp.status);
rc = EXIT_FAILURE;
}
} else {
printf("desc successful\n");
rc = memcmp(src, dst, BLEN);
done:
munmap(wq_portal, WQ_PORTAL_SIZE);
return rc;
}
Ref#:353216.002US C-4
ACTIONS FOR CONTINUATION AFTER PAGE FAULT
APPENDIX D
ACTIONS FOR CONTINUATION AFTER PAGE FAULT
Increase Descriptor List Address by Descriptors Completed × 64 and decrease Descriptor Count
by Descriptors Completed.
If any operations before the fault completed with status ≠ success, and any descriptor after the
Batch
fault has the Fence flag set, decrease the Descriptor Count not to execute the descriptor with
the Fence.
If the Descriptor Count is 1, submit the descriptor as a single descriptor rather than a batch.
If Result = 0, increase source and destination addresses and decrease transfer size.
Copy
If Result = 1, decrease transfer size. (No change to source and destination addresses.)
Dualcast Increase source and destination addresses and decrease transfer size.
Ref#:353216.002US D-1
ACTIONS FOR CONTINUATION AFTER PAGE FAULT
Ref#:353216.002US D-2
DEDICATED AND SHARED WQ COMPARISON
APPENDIX E
DEDICATED AND SHARED WQ COMPARISON
Dedicated WQ Shared WQ
Single client per DWQ and SW tracks the SW does not need to keep track of outstanding submissions
WQ Sharing number of outstanding submissions to and can use the ENQCMD ISA result to identify successful vs.
ensure no queue overflow. unsuccessful submissions.
The software can stream descriptors to Rate of descriptor submission limited by ENQCMD round-trip
Submission
the device at a very high rate using latency (approx. 200-250ns on 4th Generation Intel®
Rate
MOVDIR64B ISA with low latency. Xeon®).
Ref#:353216.002US E-1
DEBUG AIDS FOR CONFIGURATION ERRORS
APPENDIX F
DEBUG AIDS FOR CONFIGURATION ERRORS
Figure F-1. Linux kernel ACPI subsystem messages when VT-d is enabled
• Verify that the Linux Kernel configuration options mentioned in Section 2.2 are enabled.
• Use lspci to ensure the expected DSA devices exist, and the lspci output indicates that the “Kernel
driver in use:” is set to idxd.
• Run dmesg | grep -i dmar to ensure there are DMAR (DMA remapping reporting) devices enumerated
by the kernel. If VT-d is enabled in the BIOS and no DMAR devices are reported, then the IOMMU driver
may not be enabled by default, reboot with “intel_iommu=on,sm_on” added to the kernel command
line to enable VT-d scalable mode.
• Run dmesg | grep -i idxd if you see “Unable to turn on SVA feature”, VT-d scalable mode may not be
enabled by default, reboot with “intel_iommu=on,sm_on” added to the kernel command line to enable
VT-d scalable mode.
• On certain platforms, VT-d 5-level paging capability is disabled by the BIOS, you will see “SVM
disabled, incompatible paging mode” in dmesg output. In this case, pass no5lvl on the kernel
command line. This boot-time parameter disables the 5-level paging mode and forces the kernel to
use 4-level paging mode.
Ref#:353216.002US F-1