DPDK Prog Guide-17.11
DPDK Prog Guide-17.11
Release 17.11.10
1 Introduction 1
1.1 Documentation Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Overview 3
2.1 Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Environment Abstraction Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Core Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.1 Ring Manager (librte_ring) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3.2 Memory Pool Manager (librte_mempool) . . . . . . . . . . . . . . . . . . 4
2.3.3 Network Packet Buffer Management (librte_mbuf) . . . . . . . . . . . . . 6
2.3.4 Timer Manager (librte_timer) . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Ethernet* Poll Mode Driver Architecture . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Packet Forwarding Algorithm Support . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 librte_net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
i
3.4.4 Internal Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Service Cores 19
4.1 Service Core Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Enabling Services on Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Service Core Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Ring Library 21
5.1 References for Ring Implementation in FreeBSD* . . . . . . . . . . . . . . . . . 22
5.2 Lockless Ring Buffer in Linux* . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3.1 Name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.4 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5 Anatomy of a Ring Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.1 Single Producer Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.5.2 Single Consumer Dequeue . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5.3 Multiple Producers Enqueue . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.5.4 Modulo 32-bit Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Mempool Library 33
6.1 Cookies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Memory Alignment Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.4 Local Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Mempool Handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.6 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7 Mbuf Library 37
7.1 Design of Packet Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 Buffers Stored in Memory Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.3 Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4 Allocating and Freeing mbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 Manipulating mbufs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6 Meta Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.7 Direct and Indirect Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.8 Debug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.9 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ii
8.5.2 Generic Packet Representation . . . . . . . . . . . . . . . . . . . . . . . 49
8.5.3 Ethernet Device API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.5.4 Extended Statistics API . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.5.5 NIC Reset API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
iii
11 Traffic Management API 90
11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
11.2 Capability API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
11.3 Scheduling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.4 Traffic Shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.5 Congestion Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
11.6 Packet Marking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
11.7 Steps to Setup the Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
11.7.1 Initial Hierarchy Specification . . . . . . . . . . . . . . . . . . . . . . . . 93
11.7.2 Hierarchy Commit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
11.7.3 Run-Time Hierarchy Updates . . . . . . . . . . . . . . . . . . . . . . . . 93
iv
14.2.1 Link Status Change Interrupts / Polling . . . . . . . . . . . . . . . . . . . 120
14.2.2 Requirements / Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 120
14.2.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
14.3 Using Link Bonding Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.3.1 Using the Poll Mode Driver from an Application . . . . . . . . . . . . . . 124
14.3.2 Using Link Bonding Devices from the EAL Command Line . . . . . . . . 125
v
19.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
19.2.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
19.2.2 Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
19.2.3 Limitations in the Number of Rules . . . . . . . . . . . . . . . . . . . . . 156
19.2.4 Use Case: IPv4 Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . 156
19.2.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
vi
26.4 Supported GSO Packet Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
26.4.1 TCP/IPv4 GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
26.4.2 VxLAN GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
26.4.3 GRE GSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
26.5 How to Segment a Packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
vii
32 Event Ethernet Rx Adapter Library 204
32.1 API Walk-through . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
32.1.1 Creating an Adapter Instance . . . . . . . . . . . . . . . . . . . . . . . . 204
32.1.2 Adding Rx Queues to the Adapter Instance . . . . . . . . . . . . . . . . 205
32.1.3 Querying Adapter Capabilities . . . . . . . . . . . . . . . . . . . . . . . . 205
32.1.4 Configuring the Service Function . . . . . . . . . . . . . . . . . . . . . . 205
32.1.5 Starting the Adapter Instance . . . . . . . . . . . . . . . . . . . . . . . . 206
32.1.6 Getting Adapter Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 206
viii
36.4.1 Table Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
36.4.2 Table Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
36.4.3 Hash Table Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
36.5 Pipeline Library Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
36.5.1 Connectivity of Ports and Tables . . . . . . . . . . . . . . . . . . . . . . 276
36.5.2 Port Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
36.5.3 Table Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
36.6 Multicore Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
36.6.1 Shared Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
36.7 Interfacing with Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
ix
41.3.6 Variables that can be Set/Overridden by the User on the Command Line
Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
41.3.7 Variables that Can be Set/Overridden by the User in a Makefile or Com-
mand Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
x
47.5.2 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
47.6 Setting the Target CPU Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
49 Glossary 317
xi
CHAPTER
ONE
INTRODUCTION
1
Programmer’s Guide, Release 17.11.10
The following documents provide information that is relevant to the development of applications
using the DPDK:
• Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Pro-
gramming Guide
Part 1: Architecture Overview
TWO
OVERVIEW
This section gives a global overview of the architecture of Data Plane Development Kit (DPDK).
The main goal of the DPDK is to provide a simple, complete framework for fast packet process-
ing in data plane applications. Users may use the code to understand some of the techniques
employed, to build upon for prototyping or to add their own protocol stacks. Alternative ecosys-
tem options that use the DPDK are available.
The framework creates a set of libraries for specific environments through the creation of an
Environment Abstraction Layer (EAL), which may be specific to a mode of the Intel® architec-
ture (32-bit or 64-bit), Linux* user space compilers or a specific platform. These environments
are created through the use of make files and configuration files. Once the EAL library is cre-
ated, the user may link with the library to create their own applications. Other libraries, outside
of EAL, including the Hash, Longest Prefix Match (LPM) and rings libraries are also provided.
Sample applications are provided to help show the user how to use various features of the
DPDK.
The DPDK implements a run to completion model for packet processing, where all resources
must be allocated prior to calling Data Plane applications, running as execution units on logical
processing cores. The model does not support a scheduler and all devices are accessed by
polling. The primary reason for not using interrupts is the performance overhead imposed by
interrupt processing.
In addition to the run-to-completion model, a pipeline model may also be used by passing
packets or messages between cores via the rings. This allows work to be performed in stages
and may allow more efficient use of code on cores.
The DPDK project installation requires Linux and the associated toolchain, such as one or more
compilers, assembler, make utility, editor and various libraries to create the DPDK components
and libraries.
Once these libraries are created for the specific environment and architecture, they may then
be used to create the user’s data plane application.
When creating applications for the Linux user space, the glibc library is used. For DPDK
applications, two environmental variables (RTE_SDK and RTE_TARGET) must be configured
before compiling the applications. The following are examples of how the variables can be set:
export RTE_SDK=/home/user/DPDK
export RTE_TARGET=x86_64-native-linuxapp-gcc
3
Programmer’s Guide, Release 17.11.10
See the DPDK Getting Started Guide for information on setting up the development environ-
ment.
The Environment Abstraction Layer (EAL) provides a generic interface that hides the environ-
ment specifics from the applications and libraries. The services provided by the EAL are:
• DPDK loading and launching
• Support for multi-process and multi-thread execution types
• Core affinity/assignment procedures
• System memory allocation/de-allocation
• Atomic/lock operations
• Time reference
• PCI bus access
• Trace and debug functions
• CPU feature identification
• Interrupt handling
• Alarm operations
• Memory management (malloc)
The EAL is fully described in Environment Abstraction Layer .
The core components are a set of libraries that provide all the elements needed for high-
performance packet processing applications.
The ring structure provides a lockless multi-producer, multi-consumer FIFO API in a finite size
table. It has some advantages over lockless queues; easier to implement, adapted to bulk
operations and faster. A ring is used by the Memory Pool Manager (librte_mempool) and
may be used as a general communication mechanism between cores and/or execution blocks
connected together on a logical core.
This ring buffer and its usage are fully described in Ring Library .
The Memory Pool Manager is responsible for allocating pools of objects in memory. A pool
is identified by name and uses a ring to store free objects. It provides some other optional
Manipulation of packet
buffers carrying network
X Y data.
X uses Y
Handle a pool of objects
using a ring to store rte_mbuf
them. Allow bulk
Timer facilities. Based enqueue/dequeue and
on HPET interface that per-CPU cache.
is provided by EAL.
rte_ring
rte_timer rte_mempool
Fixed-size lockless
FIFO for storing objects
in a table.
rte_malloc rte_eal + libc
services, such as a per-core object cache and an alignment helper to ensure that objects are
padded to spread them equally on all RAM channels.
This memory pool allocator is described in Mempool Library .
The mbuf library provides the facility to create and destroy buffers that may be used by the
DPDK application to store message buffers. The message buffers are created at startup time
and stored in a mempool, using the DPDK mempool library.
This library provides an API to allocate/free mbufs, manipulate control message buffers (ctrlm-
buf) which are generic message buffers, and packet buffers (pktmbuf) which are used to carry
network packets.
Network Packet Buffer Management is described in Mbuf Library .
This library provides a timer service to DPDK execution units, providing the ability to execute
a function asynchronously. It can be periodic function calls, or just a one-shot call. It uses
the timer interface provided by the Environment Abstraction Layer (EAL) to get a precise time
reference and can be initiated on a per-core basis as required.
The library documentation is available in Timer Library .
The DPDK includes Poll Mode Drivers (PMDs) for 1 GbE, 10 GbE and 40GbE, and para virtu-
alized virtio Ethernet controllers which are designed to work without asynchronous, interrupt-
based signaling mechanisms.
See Poll Mode Driver .
The DPDK includes Hash (librte_hash) and Longest Prefix Match (LPM,librte_lpm) libraries to
support the corresponding packet forwarding algorithms.
See Hash Library and LPM Library for more information.
2.6 librte_net
THREE
The Environment Abstraction Layer (EAL) is responsible for gaining access to low-level re-
sources such as hardware and memory space. It provides a generic interface that hides the
environment specifics from the applications and libraries. It is the responsibility of the initial-
ization routine to decide how to allocate these resources (that is, memory space, PCI devices,
timers, consoles, and so on).
Typical services expected from the EAL are:
• DPDK Loading and Launching: The DPDK and its application are linked as a single
application and must be loaded by some means.
• Core Affinity/Assignment Procedures: The EAL provides mechanisms for assigning exe-
cution units to specific cores as well as creating execution instances.
• System Memory Reservation: The EAL facilitates the reservation of different memory
zones, for example, physical memory areas for device interactions.
• PCI Address Abstraction: The EAL provides an interface to access PCI address space.
• Trace and Debug Functions: Logs, dump_stack, panic and so on.
• Utility Functions: Spinlocks and atomic counters that are not provided in libc.
• CPU Feature Identification: Determine at runtime if a particular feature, for example,
Intel® AVX is supported. Determine if the current CPU supports the feature set that the
binary was compiled for.
• Interrupt Handling: Interfaces to register/unregister callbacks to specific interrupt
sources.
• Alarm Functions: Interfaces to set/remove callbacks to be run at a specific time.
In a Linux user space environment, the DPDK application runs as a user-space application
using the pthread library. PCI information about devices and address space is discovered
through the /sys kernel interface and through kernel modules such as uio_pci_generic, or
igb_uio. Refer to the UIO: User-space drivers documentation in the Linux kernel. This memory
is mmap’d in the application.
The EAL performs physical memory allocation using mmap() in hugetlbfs (using huge page
sizes to increase performance). This memory is exposed to DPDK service layers such as the
Mempool Library .
7
Programmer’s Guide, Release 17.11.10
At this point, the DPDK services layer will be initialized, then through pthread setaffinity calls,
each execution unit will be assigned to a specific logical core to run as a user-level thread.
The time reference is provided by the CPU Time-Stamp Counter (TSC) or by the HPET kernel
API through a mmap() call.
Part of the initialization is done by the start function of glibc. A check is also performed at
initialization time to ensure that the micro architecture type chosen in the config file is supported
by the CPU. Then, the main() function is called. The core initialization and launch is done
in rte_eal_init() (see the API documentation). It consist of calls to the pthread library (more
specifically, pthread_self(), pthread_create(), and pthread_setaffinity_np()).
Note: Initialization of objects, such as memory zones, rings, memory pools, lpm tables and
hash tables, should be done as part of the overall application initialization on the master lcore.
The creation and initialization functions for these objects are not multi-thread safe. However,
once initialized, the objects themselves can safely be used in multiple threads simultaneously.
The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesys-
tem. The EAL provides an API to reserve named memory zones in this contiguous memory.
The physical address of the reserved memory for that memory zone is also returned to the
user by the memory zone reservation API.
Note: Memory reservations done using the APIs provided by rte_malloc are also backed by
pages from the hugetlbfs filesystem.
The EAL uses the /sys/bus/pci utilities provided by the kernel to scan the content on the PCI
bus. To access PCI memory, a kernel module called uio_pci_generic provides a /dev/uioX
device file and resource files in /sys that can be mmap’d to obtain access to PCI address
space from the application. The DPDK-specific igb_uio module can also be used for this. Both
drivers use the uio kernel feature (userland driver).
main()
rte_eal_init()
rte_eal_memory_init()
rte_eal_logs_init()
rte_eal_pci_init()
...
wait
rte_eal_remote_lauch(
per_lcore_app_init)
rte_eal_mp_wait_lcore()
per_lcore_ per_lcore_
app_init() app_init()
wait wait
rte_eal_remote_lauch(app)
Note: lcore refers to a logical execution unit of the processor, sometimes called a hardware
thread.
Shared variables are the default behavior. Per-lcore variables are implemented using Thread
Local Storage (TLS) to provide per-thread local storage.
3.1.6 Logs
A logging API is provided by EAL. By default, in a Linux application, logs are sent to syslog and
also to the console. However, the log function can be overridden by the user to use a different
logging mechanism.
There are some debug functions to dump the stack in glibc. The rte_panic() function can
voluntarily provoke a SIG_ABORT, which can trigger the generation of a core file, readable by
gdb.
The EAL can query the CPU at runtime (using the rte_cpu_get_features() function) to deter-
mine which CPU features are available.
Note: In DPDK PMD, the only interrupts handled by the dedicated host thread are those for
link status change (link up and link down notification) and for sudden device removal.
• RX Interrupt Event
The receive and transmit routines provided by each PMD don’t limit themselves to execute in
polling thread mode. To ease the idle polling with tiny throughput, it’s useful to pause the polling
and wait until the wake-up event happens. The RX interrupt is the first choice to be such kind
of wake-up event, but probably won’t be the only one.
EAL provides the event APIs for this event-driven thread mode. Taking linuxapp as an example,
the implementation relies on epoll. Each thread can monitor an epoll instance in which all the
wake-up events’ file descriptors are added. The event file descriptors are created and mapped
to the interrupt vectors according to the UIO/VFIO spec. From bsdapp’s perspective, kqueue
is the alternative way, but not implemented yet.
EAL initializes the mapping between event file descriptors and interrupt vectors, while each
device initializes the mapping between interrupt vectors and queues. In this way, EAL actually
is unaware of the interrupt cause on the specific vector. The eth_dev driver takes responsibility
to program the latter mapping.
Note: Per queue RX interrupt event is only allowed in VFIO which supports multiple MSI-
X vector. In UIO, the RX interrupt together with other interrupt causes shares the same
vector. In this case, when RX interrupt and LSC(link status change) interrupt are both en-
abled(intr_conf.lsc == 1 && intr_conf.rxq == 1), only the former is capable.
3.1.9 Blacklisting
The EAL PCI device blacklist functionality can be used to mark certain NIC ports as blacklisted,
so they are ignored by the DPDK. The ports to be blacklisted are identified using the PCIe*
description (Domain:Bus:Device.Function).
The mapping of physical memory is provided by this feature in the EAL. As physical memory
can have gaps, the memory is described in a table of descriptors, and each descriptor (called
rte_memseg ) describes a contiguous portion of memory.
On top of this, the memzone allocator’s role is to reserve contiguous portions of physical mem-
ory. These zones are identified by a unique name when the memory is reserved.
The rte_memzone descriptors are also located in the configuration structure. This structure is
accessed using rte_eal_get_configuration(). The lookup (by name) of a memory zone returns
a descriptor containing the physical address of the memory zone.
Memory zones can be reserved with specific start address alignment by supplying the align
parameter (by default, they are aligned to cache line size). The alignment value should be a
power of two and not less than the cache line size (64 bytes). Memory zones can also be
reserved from either 2 MB or 1 GB hugepages, provided that both are available on the system.
DPDK usually pins one pthread per core to avoid the overhead of task switching. This allows
for significant performance gains, but lacks flexibility and is not always efficient.
Power management helps to improve the CPU efficiency by limiting the CPU runtime frequency.
However, alternately it is possible to utilize the idle cycles available to take advantage of the
full capability of the CPU.
By taking advantage of cgroup, the CPU utilization quota can be simply assigned. This gives
another way to improve the CPU efficiency, however, there is a prerequisite; DPDK must handle
the context switching between multiple pthreads per core.
For further flexibility, it is useful to set pthread affinity not only to a CPU but to a CPU set.
The term “lcore” refers to an EAL thread, which is really a Linux/FreeBSD pthread. “EAL
pthreads” are created and managed by EAL and execute the tasks issued by remote_launch.
In each EAL pthread, there is a TLS (Thread Local Storage) called _lcore_id for unique identi-
fication. As EAL pthreads usually bind 1:1 to the physical CPU, the _lcore_id is typically equal
to the CPU ID.
When using multiple pthreads, however, the binding is no longer always 1:1 between an EAL
pthread and a specified physical CPU. The EAL pthread may have affinity to a CPU set, and
as such the _lcore_id will not be the same as the CPU ID. For this reason, there is an EAL
long option ‘–lcores’ defined to assign the CPU affinity of lcores. For a specified lcore ID or ID
group, the option allows setting the CPU set for that EAL pthread.
The format pattern: –lcores=’<lcore_set>[@cpu_set][,<lcore_set>[@cpu_set],...]’
‘lcore_set’ and ‘cpu_set’ can be a single number, range or a group.
A number is a “digit([0-9]+)”; a range is “<number>-<number>”; a group is “(<num-
ber|range>[,<number|range>,...])”.
If a ‘@cpu_set’ value is not supplied, the value of ‘cpu_set’ will default to the value of ‘lcore_set’.
For example, "--lcores='1,2@(5-7),(3-5)@(0,2),(0,6),7-8'" which means start 9 EAL thread;
lcore 0 runs on cpuset 0x41 (cpu 0,6);
lcore 1 runs on cpuset 0x2 (cpu 1);
lcore 2 runs on cpuset 0xe0 (cpu 5,6,7);
lcore 3,4,5 runs on cpuset 0x5 (cpu 0,2);
lcore 6 runs on cpuset 0x41 (cpu 0,6);
lcore 7 runs on cpuset 0x80 (cpu 7);
lcore 8 runs on cpuset 0x100 (cpu 8).
Using this option, for each given lcore ID, the associated CPUs can be assigned. It’s also
compatible with the pattern of corelist(‘-l’) option.
It is possible to use the DPDK execution context with any user pthread (aka. Non-EAL
pthreads). In a non-EAL pthread, the _lcore_id is always LCORE_ID_ANY which identifies
that it is not an EAL thread with a valid, unique, _lcore_id. Some libraries will use an alter-
native unique ID (e.g. TID), some will not be impacted at all, and some will work but with
limitations (e.g. timer and mempool libraries).
All these impacts are mentioned in Known Issues section.
• rte_mempool
The rte_mempool uses a per-lcore cache inside the mempool. For non-EAL pthreads,
rte_lcore_id() will not return a valid number. So for now, when rte_mempool is used
with non-EAL pthreads, the put/get operations will bypass the default mempool cache and
there is a performance penalty because of this bypass. Only user-owned external caches
can be used in a non-EAL context in conjunction with rte_mempool_generic_put()
and rte_mempool_generic_get() that accept an explicit cache parameter.
• rte_ring
rte_ring supports multi-producer enqueue and multi-consumer dequeue. However, it is
non-preemptive, this has a knock on effect of making rte_mempool non-preemptable.
This does not mean it cannot be used, simply, there is a need to narrow down the situation
when it is used by multi-pthread on the same core.
The following is a simple example of cgroup control usage, there are two pthreads(t0 and t1)
doing packet I/O on the same core ($CPU). We expect only 50% of CPU spend on packet IO.
mkdir /sys/fs/cgroup/cpu/pkt_io
mkdir /sys/fs/cgroup/cpuset/pkt_io
cd /sys/fs/cgroup/cpu/pkt_io
echo 100000 > pkt_io/cpu.cfs_period_us
echo 50000 > pkt_io/cpu.cfs_quota_us
3.4 Malloc
3.4. Malloc 14
Programmer’s Guide, Release 17.11.10
3.4.1 Cookies
The rte_malloc() takes an align argument that can be used to request a memory area that is
aligned on a multiple of this value (which must be a power of two).
On systems with NUMA support, a call to the rte_malloc() function will return memory that has
been allocated on the NUMA socket of the core which made the call. A set of APIs is also
provided, to allow memory to be explicitly allocated on a NUMA socket directly, or by allocated
on the NUMA socket where another core is located, in the case where the memory is to be
used by a logical core other than on the one doing the memory allocation.
This API is meant to be used by an application that requires malloc-like functions at initialization
time.
For allocating/freeing data at runtime, in the fast-path of an application, the memory pool library
should be used instead.
Data Structures
There are two data structure types used internally in the malloc library:
• struct malloc_heap - used to track free space on a per-socket basis
• struct malloc_elem - the basic element of allocation and free-space tracking inside the
library.
Structure: malloc_heap
The malloc_heap structure is used to manage free space on a per-socket basis. Internally,
there is one heap structure per NUMA node, which allows us to allocate memory to a thread
based on the NUMA node on which this thread runs. While this does not guarantee that the
memory will be used on that NUMA node, it is no worse than a scheme where the memory is
always allocated on a fixed or random node.
The key fields of the heap structure and their function are described below (see also diagram
above):
• lock - the lock field is needed to synchronize access to the heap. Given that the free
space in the heap is tracked using a linked list, we need a lock to prevent two threads
manipulating the list at the same time.
• free_head - this points to the first element in the list of free nodes for this malloc heap.
3.4. Malloc 15
Programmer’s Guide, Release 17.11.10
Note: The malloc_heap structure does not keep track of in-use blocks of memory, since these
are never touched except when they are to be freed again - at which point the pointer to the
block is an input to the free() function.
struct malloc_heap
free_head
prev prev
Memseg 0
size
prev
prev
Dummy Elements:
prev next_free Size = 0
next_free State = BUSY
prev prev
pad
prev size
Memseg 1
prev
Malloc element header: Pad element header:
state = BUSY state = PAD
size = <size> pad = padsize
pad = <padsize>
Free element header(struct malloc_elem, state = FREE) Free / Unallocated data space
Used element header(struct malloc_elem, state = BUSY) Used / allocated data space
Pad element header(struct malloc_elem, state = PAD) Padding / unavailable space
Generic element pointers
Fig. 3.2: Example of a malloc heap and malloc elements within the malloc library
Structure: malloc_elem
The malloc_elem structure is used as a generic header structure for various blocks of memory.
It is used in three different ways - all shown in the diagram above:
1. As a header on a block of free or allocated memory - normal case
2. As a padding header inside a block of memory
3. As an end-of-memseg marker
The most important fields in the structure and how they are used are described below.
Note: If the usage of a particular field in one of the above three usages is not described, the
field can be assumed to have an undefined value in that situation, for example, for padding
headers only the “state” and “pad” fields have valid values.
• heap - this pointer is a reference back to the heap structure from which this block was
3.4. Malloc 16
Programmer’s Guide, Release 17.11.10
allocated. It is used for normal memory blocks when they are being freed, to add the
newly-freed block to the heap’s free-list.
• prev - this pointer points to the header element/block in the memseg immediately behind
the current one. When freeing a block, this pointer is used to reference the previous block
to check if that block is also free. If so, then the two free blocks are merged to form a
single larger block.
• next_free - this pointer is used to chain the free-list of unallocated memory blocks to-
gether. It is only used in normal memory blocks; on malloc() to find a suitable free
block to allocate and on free() to add the newly freed element to the free-list.
• state - This field can have one of three values: FREE, BUSY or PAD. The former two are
to indicate the allocation state of a normal memory block and the latter is to indicate that
the element structure is a dummy structure at the end of the start-of-block padding, i.e.
where the start of the data within a block is not at the start of the block itself, due to
alignment constraints. In that case, the pad header is used to locate the actual malloc
element header for the block. For the end-of-memseg structure, this is always a BUSY
value, which ensures that no element, on being freed, searches beyond the end of the
memseg for other blocks to merge with into a larger free area.
• pad - this holds the length of the padding present at the start of the block. In the case
of a normal block header, it is added to the address of the end of the header to give the
address of the start of the data area, i.e. the value passed back to the application on
a malloc. Within a dummy header inside the padding, this same value is stored, and is
subtracted from the address of the dummy header to yield the address of the actual block
header.
• size - the size of the data block, including the header itself. For end-of-memseg struc-
tures, this size is given as zero, though it is never actually checked. For normal blocks
which are being freed, this size value is used in place of a “next” pointer to identify the
location of the next block of memory that in the case of being FREE, the two free blocks
can be merged into one.
Memory Allocation
On EAL initialization, all memsegs are setup as part of the malloc heap. This setup involves
placing a dummy structure at the end with BUSY state, which may contain a sentinel value if
CONFIG_RTE_MALLOC_DEBUG is enabled, and a proper element header with FREE at the start
for each memseg. The FREE element is then added to the free_list for the malloc heap.
When an application makes a call to a malloc-like function, the malloc function will first index the
lcore_config structure for the calling thread, and determine the NUMA node of that thread.
The NUMA node is used to index the array of malloc_heap structures which is passed as a
parameter to the heap_alloc() function, along with the requested size, type, alignment and
boundary parameters.
The heap_alloc() function will scan the free_list of the heap, and attempt to find a free block
suitable for storing data of the requested size, with the requested alignment and boundary
constraints.
When a suitable free element has been identified, the pointer to be returned to the user is
calculated. The cache-line of memory immediately preceding this pointer is filled with a struct
malloc_elem header. Because of alignment and boundary constraints, there could be free
space at the start and/or end of the element, resulting in the following behavior:
3.4. Malloc 17
Programmer’s Guide, Release 17.11.10
1. Check for trailing space. If the trailing space is big enough, i.e. > 128 bytes, then the free
element is split. If it is not, then we just ignore it (wasted space).
2. Check for space at the start of the element. If the space at the start is small, i.e. <=128
bytes, then a pad header is used, and the remaining space is wasted. If, however, the
remaining space is greater, then the free element is split.
The advantage of allocating the memory from the end of the existing element is that no ad-
justment of the free list needs to take place - the existing element on the free list just has its
size pointer adjusted, and the following element has its “prev” pointer redirected to the newly
created element.
Freeing Memory
To free an area of memory, the pointer to the start of the data area is passed to the free
function. The size of the malloc_elem structure is subtracted from this pointer to get the
element header for the block. If this header is of type PAD then the pad length is further
subtracted from the pointer to get the proper element header for the entire block.
From this element header, we get pointers to the heap from which the block was allocated and
to where it must be freed, as well as the pointer to the previous element, and via the size field,
we can calculate the pointer to the next element. These next and previous elements are then
checked to see if they are also FREE, and if so, they are merged with the current element. This
means that we can never have two FREE memory blocks adjacent to one another, as they are
always merged into a single block.
3.4. Malloc 18
CHAPTER
FOUR
SERVICE CORES
DPDK has a concept known as service cores, which enables a dynamic way of performing
work on DPDK lcores. Service core support is built into the EAL, and an API is provided to
optionally allow applications to control how the service cores are used at runtime.
The service cores concept is built up out of services (components of DPDK that require CPU
cycles to operate) and service cores (DPDK lcores, tasked with running services). The power
of the service core concept is that the mapping between service cores and services can be
configured to abstract away the difference between platforms and environments.
For example, the Eventdev has hardware and software PMDs. Of these the software PMD
requires an lcore to perform the scheduling operations, while the hardware PMD does not.
With service cores, the application would not directly notice that the scheduling is done in
software.
For detailed information about the service core API, please refer to the docs.
There are two methods to having service cores in a DPDK application, either by using the
service coremask, or by dynamically adding cores using the API. The simpler of the two is to
pass the -s coremask argument to EAL, which will take any cores available in the main DPDK
coremask, an if the bits are also set in the service coremask the cores become service-cores
instead of DPDK application lcores.
Each registered service can be individually mapped to a service core, or set of service cores.
Enabling a service on a particular core means that the lcore in question will run the service.
Disabling that core on the service stops the lcore in question from running the service.
Using this method, it is possible to assign specific workloads to each service core, and map N
workloads to M number of service cores. Each service lcore loops over the services that are
enabled for that core, and invokes the function to run the service.
19
Programmer’s Guide, Release 17.11.10
The service core library is capable of collecting runtime statistics like number of calls to a
specific service, and number of cycles used by the service. The cycle count collection is
dynamically configurable, allowing any application to profile the services running on the system
at any time.
FIVE
RING LIBRARY
The ring allows the management of queues. Instead of having a linked list of infinite size, the
rte_ring has the following properties:
• FIFO
• Maximum size is fixed, the pointers are stored in a table
• Lockless implementation
• Multi-consumer or single-consumer dequeue
• Multi-producer or single-producer enqueue
• Bulk dequeue - Dequeues the specified count of objects if successful; otherwise fails
• Bulk enqueue - Enqueues the specified count of objects if successful; otherwise fails
• Burst dequeue - Dequeue the maximum available objects if the specified count cannot
be fulfilled
• Burst enqueue - Enqueue the maximum available objects if the specified count cannot
be fulfilled
The advantages of this data structure over a linked list queue are as follows:
• Faster; only requires a single Compare-And-Swap instruction of sizeof(void *) instead of
several double-Compare-And-Swap instructions.
• Simpler than a full lockless queue.
• Adapted to bulk enqueue/dequeue operations. As pointers are stored in a table, a de-
queue of several objects will not produce as many cache misses as in a linked queue.
Also, a bulk dequeue of many objects does not cost more than a dequeue of a simple
object.
The disadvantages:
• Size is fixed
• Having many rings costs more in terms of memory than a linked list queue. An empty
ring contains at least N pointers.
A simplified representation of a Ring is shown in with consumer and producer head and tail
pointers to objects stored in the data structure.
21
Programmer’s Guide, Release 17.11.10
cons_head prod_head
cons_tail prod_tail
The following code was added in FreeBSD 8.0, and is used in some network device drivers (at
least in Intel drivers):
• bufring.h in FreeBSD
• bufring.c in FreeBSD
The following is a link describing the Linux Lockless Ring Buffer Design.
5.3.1 Name
A ring is identified by a unique name. It is not possible to create two rings with the same name
(rte_ring_create() returns NULL if this is attempted).
This section explains how a ring buffer operates. The ring structure is composed of two head
and tail couples; one is used by producers and one is used by the consumers. The figures of
the following sections refer to them as prod_head, prod_tail, cons_head and cons_tail.
Each figure represents a simplified state of the ring, which is a circular buffer. The content
of the function local variables is represented on the top of the figure, and the content of ring
structure is represented on the bottom of the figure.
This section explains what occurs when a producer adds an object to the ring. In this example,
only the producer head and tail (prod_head and prod_tail) are modified, and there is only one
producer.
The initial state is to have a prod_head and prod_tail pointing at the same location.
First, ring->prod_head and ring->cons_tail are copied in local variables. The prod_next lo-
cal variable points to the next element of the table, or several elements after in case of bulk
enqueue.
If there is not enough room in the ring (this is detected by checking cons_tail), it returns an
error.
local variables
cons_head prod_head
cons_tail prod_tail
structure state
The second step is to modify ring->prod_head in ring structure to point to the same location
as prod_next.
A pointer to the added object is copied in the ring (obj4).
local variables
structure state
Once the object is added in the ring, ring->prod_tail in the ring structure is modified to point to
the same location as ring->prod_head. The enqueue operation is finished.
This section explains what occurs when a consumer dequeues an object from the ring. In this
example, only the consumer head and tail (cons_head and cons_tail) are modified and there
is only one consumer.
The initial state is to have a cons_head and cons_tail pointing at the same location.
First, ring->cons_head and ring->prod_tail are copied in local variables. The cons_next local
variable points to the next element of the table, or several elements after in the case of bulk
local variables
cons_head prod_tail
cons_tail prod_head
structure state
dequeue.
If there are not enough objects in the ring (this is detected by checking prod_tail), it returns an
error.
The second step is to modify ring->cons_head in the ring structure to point to the same location
as cons_next.
The pointer to the dequeued object (obj1) is copied in the pointer given by the user.
Finally, ring->cons_tail in the ring structure is modified to point to the same location as ring-
>cons_head. The dequeue operation is finished.
This section explains what occurs when two producers concurrently add an object to the ring.
In this example, only the producer head and tail (prod_head and prod_tail) are modified.
The initial state is to have a prod_head and prod_tail pointing at the same location.
local variables
cons_head prod_tail
cons_tail prod_head
structure state
local variables
structure state
Fig. 5.6: Dequeue second step
local variables
cons_head prod_tail
cons_tail prod_head
structure state
On both cores, ring->prod_head and ring->cons_tail are copied in local variables. The
prod_next local variable points to the next element of the table, or several elements after in
the case of bulk enqueue.
If there is not enough room in the ring (this is detected by checking cons_tail), it returns an
error.
The second step is to modify ring->prod_head in the ring structure to point to the same location
as prod_next. This operation is done using a Compare And Swap (CAS) instruction, which
does the following operations atomically:
• If ring->prod_head is different to local variable prod_head, the CAS operation fails, and
the code restarts at first step.
• Otherwise, ring->prod_head is set to local prod_next, the CAS operation is successful,
and processing continues.
In the figure, the operation succeeded on core 1, and step one restarted on core 2.
cons_head prod_head
cons_tail prod_tail
structure state
cons_head prod_head
cons_tail prod_tail
structure state
The core 1 updates one element of the ring(obj4), and the core 2 updates another one (obj5).
compare and swap succeeds
on core 2
local variables cons_tail prod_head prod_next
core 2
cons_head prod_head
cons_tail prod_tail
structure state
Each core now wants to update ring->prod_tail. A core can only update it if ring->prod_tail is
equal to the prod_head local variable. This is only true on core 1. The operation is finished on
core 1.
Once ring->prod_tail is updated by core 1, core 2 is allowed to update it too. The operation is
also finished on core 2.
In the preceding figures, the prod_head, prod_tail, cons_head and cons_tail indexes are repre-
sented by arrows. In the actual implementation, these values are not between 0 and size(ring)-
1 as would be assumed. The indexes are between 0 and 2^32 -1, and we mask their value
when we access the pointer table (the ring itself). 32-bit modulo also implies that operations
on indexes (such as, add/subtract) will automatically do 2^32 modulo if the result overflows the
32-bit number range.
The following are two examples that help to explain how indexes are used in a ring.
cons_head prod_head
cons_tail prod_tail
structure state
cons_head prod_head
cons_tail prod_tail
structure state
Note: To simplify the explanation, operations with modulo 16-bit are used instead of modulo
32-bit. In addition, the four indexes are defined as unsigned 16-bit integers, as opposed to
unsigned 32-bit integers in the more realistic case.
0 0
0 16384 32768 49152 65536 16384 32768 49152 65536 16384
value for
ring indexes
(prod_head,
used_entries prod_tail, ...)
size = 16384
ch ph
mask = 16383
ct pt
ph = pt = 14000
ct = ch = 3000 used entries in ring
used_entries = (pt - ch) % 65536 = 11000
free_entries = (mask + ct - ph) % 65536 = 5383
value for
ring indexes
(prod_head,
used_entries prod_tail, ...)
size = 16384
ch ph
mask = 16383
ct pt
ph = pt = 6000
ct = ch = 59000 used entries in ring
used_entries = (pt - ch) % 65536 = 12536
free_entries = (mask + ct - ph) % 65536 = 3847
Note: For ease of understanding, we use modulo 65536 operations in the above examples.
In real execution cases, this is redundant for low efficiency, but is done automatically when the
result overflows.
The code always maintains a distance between producer and consumer between 0 and
size(ring)-1. Thanks to this property, we can do subtractions between 2 index values in a
modulo-32bit base: that’s why the overflow of the indexes is not a problem.
At any time, entries and free_entries are between 0 and size(ring)-1, even if only the first term
of subtraction has overflowed:
uint32_t entries = (prod_tail - cons_head);
uint32_t free_entries = (mask + cons_tail -prod_head);
5.6 References
5.6. References 31
Programmer’s Guide, Release 17.11.10
5.6. References 32
CHAPTER
SIX
MEMPOOL LIBRARY
A memory pool is an allocator of a fixed-sized object. In the DPDK, it is identified by name and
uses a mempool handler to store free objects. The default mempool handler is ring based. It
provides some other optional services such as a per-core object cache and an alignment helper
to ensure that objects are padded to spread them equally on all DRAM or DDR3 channels.
This library is used by the Mbuf Library .
6.1 Cookies
6.2 Stats
33
Programmer’s Guide, Release 17.11.10
Note: The command line must always have the number of memory channels specified for the
processor.
Examples of alignment for different DIMM architectures are shown in Fig. 6.1 and Fig. 6.2.
memory addresses 64 bytes wide
0 1 2 3 4 5 6 7 8 9 A B C D E F 0 1 2 3
packet 1 padding packet 2
In this case, the assumption is that a packet is 16 blocks of 64 bytes, which is not true.
The Intel® 5520 chipset has three channels, so in most cases, no padding is required between
objects (except for objects whose size are n x 3 x 64 bytes blocks).
memory addresses 64 bytes wide
When creating a new pool, the user can specify to use this feature or not.
In terms of CPU usage, the cost of multiple cores accessing a memory pool’s ring of free
buffers may be high since each access requires a compare-and-set (CAS) operation. To avoid
having too many access requests to the memory pool’s ring, the memory pool allocator can
maintain a per-core cache and do bulk requests to the memory pool’s ring, via the cache with
many fewer locks on the actual memory pool structure. In this way, each core has full access
to its own cache (with locks) of free objects and only when the cache fills does the core need to
shuffle some of the free objects back to the pools ring or obtain more objects when the cache
is empty.
While this may mean a number of buffers may sit idle on some core’s cache, the speed at
which a core can access its own cache for a specific memory pool without locks provides
performance gains.
The cache is composed of a small, per-core table of pointers and its length (used as a stack).
This internal cache can be enabled or disabled at creation of the pool.
The maximum size of the cache is static and is defined at compilation time (CON-
FIG_RTE_MEMPOOL_CACHE_MAX_SIZE).
Fig. 6.3 shows a cache in operation.
obj 2
obj n
Alternatively to the internal default per-lcore local cache, an application can cre-
ate and manage external caches through the rte_mempool_cache_create(),
rte_mempool_cache_free() and rte_mempool_cache_flush() calls. These
user-owned caches can be explicitly passed to rte_mempool_generic_put() and
rte_mempool_generic_get(). The rte_mempool_default_cache() call returns the
default internal cache if any. In contrast to the default caches, user-owned caches can be
used by non-EAL threads too.
This allows external memory subsystems, such as external hardware memory management
systems and software based memory allocators, to be used with DPDK.
There are two aspects to a mempool handler.
• Adding the code for your new mempool operations (ops). This is achieved by adding a
new mempool ops code, and using the MEMPOOL_REGISTER_OPS macro.
• Using the new API to call rte_mempool_create_empty() and
rte_mempool_set_ops_byname() to create a new mempool and specifying
which ops to use.
Several different mempool handlers may be used in the same application. A new mem-
pool can be created by using the rte_mempool_create_empty() function, then using
rte_mempool_set_ops_byname() to point the mempool to the relevant mempool handler
callback (ops) structure.
Legacy applications may continue to use the old rte_mempool_create() API call, which
uses a ring based mempool handler by default. These applications will need to be modified to
All allocations that require a high level of performance should use a pool-based memory allo-
cator. Below are some examples:
• Mbuf Library
• Environment Abstraction Layer , for logging service
• Any application that needs to allocate fixed-sized objects in the data plane and that will
be continuously utilized by the system.
SEVEN
MBUF LIBRARY
The mbuf library provides the ability to allocate and free buffers (mbufs) that may be used by
the DPDK application to store message buffers. The message buffers are stored in a mempool,
using the Mempool Library .
A rte_mbuf struct can carry network packet buffers or generic control buffers (indicated by the
CTRL_MBUF_FLAG). This can be extended to other types. The rte_mbuf header structure is
kept as small as possible and currently uses just two cache lines, with the most frequently used
fields being on the first of the two cache lines.
For the storage of the packet data (including protocol headers), two approaches were consid-
ered:
1. Embed metadata within a single memory buffer the structure followed by a fixed size area
for the packet data.
2. Use separate memory buffers for the metadata structure and for the packet data.
The advantage of the first method is that it only needs one operation to allocate/free the whole
memory representation of a packet. On the other hand, the second method is more flexible
and allows the complete separation of the allocation of metadata structures from the allocation
of packet data buffers.
The first method was chosen for the DPDK. The metadata contains control information such as
message type, length, offset to the start of the data and a pointer for additional mbuf structures
allowing buffer chaining.
Message buffers that are used to carry network packets can handle buffer chaining where
multiple buffers are required to hold the complete packet. This is the case for jumbo frames
that are composed of many mbufs linked together through their next field.
For a newly allocated mbuf, the area at which the data begins in the message buffer is
RTE_PKTMBUF_HEADROOM bytes after the beginning of the buffer, which is cache aligned.
Message buffers may be used to carry control information, packets, events, and so on between
different entities in the system. Message buffers may also use their buffer pointers to point to
other message buffer data sections or other structures.
Fig. 7.1 and Fig. 7.2 show some of these scenarios.
The Buffer Manager implements a fairly standard set of buffer access functions to manipulate
network packets.
37
Programmer’s Guide, Release 17.11.10
rte_pktmbuf_mtod(m)
mbuf
struct
headroom tailroom
m->buf_addr
(m->buf_iova is the
corresponding physical address)
struct rte_mbuf
rte_pktmbuf_pktlen(m) = rte_pktmbuf_datalen(m) +
rte_pktmbuf_datalen(mseg2) + rte_pktmbuf_datalen(mseg3)
rte_pktmbuf_mtod(m)
m mseg2 mseg3
multi-segmented rte_mbuf
The Buffer Manager uses the Mempool Library to allocate buffers. Therefore, it ensures
that the packet header is interleaved optimally across the channels and ranks for L3 pro-
cessing. An mbuf contains a field indicating the pool that it originated from. When calling
rte_ctrlmbuf_free(m) or rte_pktmbuf_free(m), the mbuf returns to its original pool.
7.3 Constructors
Packet and control mbuf constructors are provided by the API. The rte_pktmbuf_init() and
rte_ctrlmbuf_init() functions initialize some fields in the mbuf structure that are not modified by
the user once created (mbuf type, origin pool, buffer start address, and so on). This function is
given as a callback function to the rte_mempool_create() function at pool creation time.
Allocating a new mbuf requires the user to specify the mempool from which the mbuf
should be taken. For any newly-allocated mbuf, it contains one segment, with a length
of 0. The offset to data is initialized to have some bytes of headroom in the buffer
(RTE_PKTMBUF_HEADROOM).
Freeing a mbuf means returning it into its original mempool. The content of an mbuf is not
modified when it is stored in a pool (as a free mbuf). Fields initialized by the constructor do not
need to be re-initialized at mbuf allocation.
When freeing a packet mbuf that contains several segments, all of them are freed and returned
to their original mempool.
This library provides some functions for manipulating the data in a packet mbuf. For instance:
• Get data length
• Get a pointer to the start of data
• Prepend data before data
• Append data after data
• Remove data at the beginning of the buffer (rte_pktmbuf_adj())
• Remove data at the end of the buffer (rte_pktmbuf_trim()) Refer to the DPDK API Refer-
ence for details.
Some information is retrieved by the network driver and stored in an mbuf to make process-
ing easier. For instance, the VLAN, the RSS hash result (see Poll Mode Driver ) and a flag
indicating that the checksum was computed by hardware.
An mbuf also contains the input port (where it comes from), and the number of segment mbufs
in the chain.
For chained buffers, only the first mbuf of the chain stores this meta information.
For instance, this is the case on RX side for the IEEE1588 packet timestamp mechanism, the
VLAN tagging and the IP checksum computation.
On TX side, it is also possible for an application to delegate some processing to the hardware
if it supports it. For instance, the PKT_TX_IP_CKSUM flag allows to offload the computation
of the IPv4 checksum.
The following examples explain how to configure different TX offloads on a vxlan-encapsulated
tcp packet: out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload
• calculate checksum of out_ip:
mb->l2_len = len(out_eth)
mb->l3_len = len(out_ip)
mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM
set out_ip checksum to 0 in the packet
This is similar to case 1), but l2_len is different. It is supported on hardware advertising
DEV_TX_OFFLOAD_IPV4_CKSUM. Note that it can only work if outer L4 checksum is
0.
• calculate checksum of in_ip and in_tcp:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CSUM | PKT_TX_TCP_CKSUM
set in_ip checksum to 0 in the packet
set in_tcp checksum to pseudo header using rte_ipv4_phdr_cksum()
This is similar to case 2), but l2_len is different. It is supported on hardware advertising
DEV_TX_OFFLOAD_IPV4_CKSUM and DEV_TX_OFFLOAD_TCP_CKSUM. Note that
it can only work if outer L4 checksum is 0.
• segment inner TCP:
mb->l2_len = len(out_eth + out_ip + out_udp + vxlan + in_eth)
mb->l3_len = len(in_ip)
mb->l4_len = len(in_tcp)
mb->ol_flags |= PKT_TX_IPV4 | PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM |
PKT_TX_TCP_SEG;
A direct buffer is a buffer that is completely separate and self-contained. An indirect buffer
behaves like a direct buffer but for the fact that the buffer pointer and data offset in it refer to
data in another direct buffer. This is useful in situations where packets need to be duplicated
or fragmented, since indirect buffers provide the means to reuse the same packet data across
multiple buffers.
A buffer becomes indirect when it is “attached” to a direct buffer using the rte_pktmbuf_attach()
function. Each buffer has a reference counter field and whenever an indirect buffer is attached
to the direct buffer, the reference counter on the direct buffer is incremented. Similarly, when-
ever the indirect buffer is detached, the reference counter on the direct buffer is decremented.
If the resulting reference counter is equal to 0, the direct buffer is freed since it is no longer in
use.
There are a few things to remember when dealing with indirect buffers. First of all, an indirect
buffer is never attached to another indirect buffer. Attempting to attach buffer A to indirect buffer
B that is attached to C, makes rte_pktmbuf_attach() automatically attach A to C, effectively
cloning B. Secondly, for a buffer to become indirect, its reference counter must be equal to 1,
that is, it must not be already referenced by another indirect buffer. Finally, it is not possible to
reattach an indirect buffer to the direct buffer (unless it is detached first).
While the attach/detach operations can be invoked directly using the recommended
rte_pktmbuf_attach() and rte_pktmbuf_detach() functions, it is suggested to use the higher-
level rte_pktmbuf_clone() function, which takes care of the correct initialization of an indirect
buffer and can clone buffers with multiple segments.
Since indirect buffers are not supposed to actually hold any data, the memory pool for indirect
buffers should be configured to indicate the reduced memory consumption. Examples of the
initialization of a memory pool for indirect buffers (as well as use case examples for indirect
buffers) can be found in several of the sample applications, for example, the IPv4 Multicast
sample application.
7.8 Debug
7.8. Debug 42
CHAPTER
EIGHT
The DPDK includes 1 Gigabit, 10 Gigabit and 40 Gigabit and para virtualized virtio Poll Mode
Drivers.
A Poll Mode Driver (PMD) consists of APIs, provided through the BSD driver running in user
space, to configure the devices and their respective queues. In addition, a PMD accesses the
RX and TX descriptors directly without any interrupts (with the exception of Link Status Change
interrupts) to quickly receive, process and deliver packets in the user’s application. This section
describes the requirements of the PMDs, their global design principles and proposes a high-
level architecture and a generic external API for the Ethernet PMDs.
The DPDK environment for packet processing applications allows for two models, run-to-
completion and pipe-line:
• In the run-to-completion model, a specific port’s RX descriptor ring is polled for packets
through an API. Packets are then processed on the same core and placed on a port’s TX
descriptor ring through an API for transmission.
• In the pipe-line model, one core polls one or more port’s RX descriptor ring through
an API. Packets are received and passed to another core via a ring. The other core
continues to process the packet which then may be placed on a port’s TX descriptor ring
through an API for transmission.
In a synchronous run-to-completion model, each logical core assigned to the DPDK executes
a packet processing loop that includes the following steps:
• Retrieve input packets through the PMD receive API
• Process each received packet one at a time, up to its forwarding
• Send pending output packets through the PMD transmit API
Conversely, in an asynchronous pipe-line model, some logical cores may be dedicated to the
retrieval of received packets and other logical cores to the processing of previously received
packets. Received packets are exchanged between logical cores through rings. The loop for
packet retrieval includes the following steps:
• Retrieve input packets through the PMD receive API
• Provide received packets to processing lcores through packet queues
The loop for packet processing includes the following steps:
43
Programmer’s Guide, Release 17.11.10
The API and architecture of the Ethernet* PMDs are designed with the following guidelines in
mind.
PMDs must help global policy-oriented decisions to be enforced at the upper application level.
Conversely, NIC PMD functions should not impede the benefits expected by upper-level global
policies, or worse prevent such policies from being applied.
For instance, both the receive and transmit functions of a PMD have a maximum number of
packets/descriptors to poll. This allows a run-to-completion processing stack to statically fix or
to dynamically adapt its overall behavior through different global loop policies, such as:
• Receive, process immediately and transmit packets one at a time in a piecemeal fashion.
• Receive as many packets as possible, then process all received packets, transmitting
them immediately.
• Receive a given maximum number of packets, process the received packets, accumulate
them and finally send all accumulated packets to transmit.
To achieve optimal performance, overall software design choices and pure software optimiza-
tion techniques must be considered and balanced against available low-level hardware-based
optimization features (CPU cache properties, bus speed, NIC PCI bandwidth, and so on). The
case of packet transmission is an example of this software/hardware tradeoff issue when opti-
mizing burst-oriented network packet processing engines. In the initial case, the PMD could ex-
port only an rte_eth_tx_one function to transmit one packet at a time on a given queue. On top
of that, one can easily build an rte_eth_tx_burst function that loops invoking the rte_eth_tx_one
function to transmit several packets at a time. However, an rte_eth_tx_burst function is effec-
tively implemented by the PMD to minimize the driver-level transmit cost per packet through
the following optimizations:
• Share among multiple packets the un-amortized cost of invoking the rte_eth_tx_one func-
tion.
The DPDK supports NUMA allowing for better performance when a processor’s logical cores
and interfaces utilize its local memory. Therefore, mbuf allocation associated with local PCIe*
interfaces should be allocated from memory pools created in the local memory. The buffers
should, if possible, remain on the local processor to obtain the best performance results and RX
and TX buffer descriptors should be populated with mbufs allocated from a mempool allocated
from local memory.
The run-to-completion model also performs better if packet or data manipulation is in local
memory instead of a remote processors memory. This is also true for the pipe-line model
provided all logical cores used are located on the same processor.
Multiple logical cores should never share receive or transmit queues for interfaces since this
would require global locks and hinder performance.
If the PMD is DEV_TX_OFFLOAD_MT_LOCKFREE capable, multiple threads can invoke
rte_eth_tx_burst() concurrently on the same tx queue without SW lock. This PMD fea-
ture found in some NICs and useful in the following use cases:
• Remove explicit spinlock in some applications where lcores are not mapped to Tx queues
with 1:1 relation.
• In the eventdev use case, avoid dedicating a separate TX core for transmitting and thus
enables more scaling as all workers can send the packets.
See Hardware Offload for DEV_TX_OFFLOAD_MT_LOCKFREE capability probing details.
Each NIC port is uniquely designated by its (bus/bridge, device, function) PCI identifiers as-
signed by the PCI probing/enumeration function executed at DPDK initialization. Based on
their PCI identifier, NIC ports are assigned two other identifiers:
• A port index used to designate the NIC port in all functions exported by the PMD API.
• A port name used to designate the port in console messages, for administration or de-
bugging purposes. For ease of use, the port name includes the port index.
All device features that can be started or stopped “on the fly” (that is, without stopping the
device) do not require the PMD API to export dedicated functions for this purpose.
All that is required is the mapping address of the device PCI registers to implement the config-
uration of these features in specific functions outside of the drivers.
For this purpose, the PMD API exports a function that provides all the information associated
with a device that can be used to set up a given device feature outside of the driver. This
includes the PCI vendor identifier, the PCI device identifier, the mapping address of the PCI
device registers, and the name of the driver.
The main advantage of this approach is that it gives complete freedom on the choice of the
API used to configure, to start, and to stop such features.
As an example, refer to the configuration of the IEEE1588 feature for the Intel® 82576 Giga-
bit Ethernet Controller and the Intel® 82599 10 Gigabit Ethernet Controller controllers in the
testpmd application.
Other features such as the L3/L4 5-Tuple packet filtering feature of a port can be configured in
the same way. Ethernet* flow control (pause frame) can be configured on the individual port.
Refer to the testpmd source code for details. Also, L4 (UDP/TCP/ SCTP) checksum offload by
the NIC can be enabled for an individual packet as long as the packet mbuf is set up correctly.
See Hardware Offload for details.
• The values of the Prefetch, Host and Write-Back threshold registers of the transmit queue
• The minimum transmit packets to free threshold (tx_free_thresh). When the number of
descriptors used to transmit packets exceeds this threshold, the network adaptor should
be checked to see if it has written back descriptors. A value of 0 can be passed during
the TX queue configuration to indicate the default value should be used. The default
value for tx_free_thresh is 32. This ensures that the PMD does not search for completed
descriptors until at least 32 have been processed by the NIC for this queue.
• The minimum RS bit threshold. The minimum number of transmit descriptors to use be-
fore setting the Report Status (RS) bit in the transmit descriptor. Note that this parameter
may only be valid for Intel 10 GbE network adapters. The RS bit is set on the last de-
scriptor used to transmit a packet if the number of descriptors used since the last RS bit
setting, up to the first descriptor used to transmit the packet, exceeds the transmit RS
bit threshold (tx_rs_thresh). In short, this parameter controls which transmit descriptors
are written back to host memory by the network adapter. A value of 0 can be passed
during the TX queue configuration to indicate that the default value should be used. The
default value for tx_rs_thresh is 32. This ensures that at least 32 descriptors are used
before the network adapter writes back the most recently used descriptor. This saves
upstream PCIe* bandwidth resulting from TX descriptor write-backs. It is important to
note that the TX Write-back threshold (TX wthresh) should be set to 0 when tx_rs_thresh
is greater than 1. Refer to the Intel® 82599 10 Gigabit Ethernet Controller Datasheet for
more details.
The following constraints must be satisfied for tx_free_thresh and tx_rs_thresh:
• tx_rs_thresh must be greater than 0.
• tx_rs_thresh must be less than the size of the ring minus 2.
• tx_rs_thresh must be less than or equal to tx_free_thresh.
• tx_free_thresh must be greater than 0.
• tx_free_thresh must be less than the size of the ring minus 3.
• For optimal performance, TX wthresh should be set to 0 when tx_rs_thresh is greater
than 1.
One descriptor in the TX ring is used as a sentinel to avoid a hardware race condition, hence
the maximum threshold constraints.
Note: When configuring for DCB operation, at port initialization, both the number of transmit
queues and the number of receive queues must be set to 128.
Many of the drivers do not release the mbuf back to the mempool, or local cache, immediately
after the packet has been transmitted. Instead, they leave the mbuf in their Tx ring and either
perform a bulk release when the tx_rs_thresh has been crossed or free the mbuf when a
slot in the Tx ring is needed.
An application can request the driver to release used mbufs with the
rte_eth_tx_done_cleanup() API. This API requests the driver to release mbufs that are
no longer in use, independent of whether or not the tx_rs_thresh has been crossed. There
are two scenarios when an application may want the mbuf released immediately:
• When a given packet needs to be sent to multiple destination interfaces (either for Layer 2
flooding or Layer 3 multi-cast). One option is to make a copy of the packet or a copy of the
header portion that needs to be manipulated. A second option is to transmit the packet
and then poll the rte_eth_tx_done_cleanup() API until the reference count on the
packet is decremented. Then the same packet can be transmitted to the next destination
interface. The application is still responsible for managing any packet manipulations
needed between the different destination interfaces, but a packet copy can be avoided.
This API is independent of whether the packet was transmitted or dropped, only that the
mbuf is no longer in use by the interface.
• Some applications are designed to make multiple runs, like a packet generator. For
performance reasons and consistency between runs, the application may want to reset
back to an initial state between each run, where all mbufs are returned to the mempool.
In this case, it can call the rte_eth_tx_done_cleanup() API for each destination
interface it has been using to request it to release of all its used mbufs.
To determine if a driver supports this API, check for the Free Tx mbuf on demand feature in
the Network Interface Controller Drivers document.
In the DPDK offload API, offloads are divided into per-port and per-queue offloads. The dif-
ferent offloads capabilities can be queried using rte_eth_dev_info_get(). Supported
offloads can be either per-port or per-queue.
Offloads are enabled using the existing DEV_TX_OFFLOAD_* or DEV_RX_OFFLOAD_* flags.
Per-port offload configuration is set using rte_eth_dev_configure. Per-queue offload con-
figuration is set using rte_eth_rx_queue_setup and rte_eth_tx_queue_setup. To en-
able per-port offload, the offload should be set on both device configuration and queue setup.
In case of a mixed configuration the queue setup shall return with an error. To enable per-
queue offload, the offload can be set only on the queue setup. Offloads which are not enabled
are disabled by default.
For an application to use the Tx offloads API it should set the ETH_TXQ_FLAGS_IGNORE flag
in the txq_flags field located in rte_eth_txconf struct. In such cases it is not required
to set other flags in txq_flags. For an application to use the Rx offloads API it should set
the ignore_offload_bitfield bit in the rte_eth_rxmode struct. In such cases it is not
required to set other bitfield offloads in the rxmode struct.
8.5.1 Generalities
By default, all functions exported by a PMD are lock-free functions that are assumed not to be
invoked in parallel on different logical cores to work on the same target object. For instance,
a PMD receive function cannot be invoked in parallel on two logical cores to poll the same RX
queue of the same port. Of course, this function can be invoked in parallel by different logical
cores on different RX queues. It is the responsibility of the upper-level application to enforce
this rule.
If needed, parallel accesses by multiple logical cores to shared queues can be explicitly pro-
tected by dedicated inline lock-aware functions built on top of their corresponding lock-free
functions of the PMD API.
The Ethernet device API exported by the Ethernet PMDs is described in the DPDK API Refer-
ence.
The extended statistics API allows a PMD to expose all statistics that are available to it, includ-
ing statistics that are unique to the device. Each statistic has three properties name, id and
value:
• name: A human readable string formatted by the scheme detailed below.
• id: An integer that represents only that statistic.
• value: A unsigned 64-bit integer that is the value of the statistic.
Note that extended statistic identifiers are driver-specific, and hence might not be the same for
different ports. The API consists of various rte_eth_xstats_*() functions, and allows an
application to be flexible in how it retrieves statistics.
A naming scheme exists for the strings exposed to clients of the API. This is to allow scraping of
the API for statistics of interest. The naming scheme uses strings split by a single underscore
_. The scheme is as follows:
• direction
• detail 1
• detail 2
• detail n
• unit
Examples of common statistics xstats strings, formatted to comply to the scheme proposed
above:
• rx_bytes
• rx_crc_errors
• tx_multicast_packets
The scheme, although quite simple, allows flexibility in presenting and reading information
from the statistic strings. The following example illustrates the naming scheme:rx_packets.
In this example, the string is split into two components. The first component rx indicates that
the statistic is associated with the receive side of the NIC. The second component packets
indicates that the unit of measure is packets.
A more complicated example: tx_size_128_to_255_packets. In this example, tx indi-
cates transmission, size is the first detail, 128 etc are more details, and packets indicates
that this is a packet counter.
Some additions in the metadata scheme are as follows:
• If the first part does not match rx or tx, the statistic does not have an affinity with either
receive of transmit.
• If the first letter of the second part is q and this q is followed by a number, this statistic is
part of a specific queue.
An example where queue numbers are used is as follows: tx_q7_bytes which indicates this
statistic applies to queue number 7, and represents the number of transmitted bytes on that
queue.
API Design
The xstats API uses the name, id, and value to allow performant lookup of specific statistics.
Performant lookup means two things;
• No string comparisons with the name of the statistic in fast-path
• Allow requesting of only the statistics of interest
The API ensures these requirements are met by mapping the name of the statistic to a unique
id, which is used as a key for lookup in the fast-path. The API allows applications to request an
array of id values, so that the PMD only performs the required calculations. Expected usage
is that the application scans the name of each statistic, and caches the id if it has an interest
in that statistic. On the fast-path, the integer can be used to retrieve the actual value of the
statistic that the id represents.
API Functions
The API is built out of a small number of functions, which can be used to retrieve the number
of statistics and the names, IDs and values of those statistics.
• rte_eth_xstats_get_names_by_id(): returns the names of the statistics. When
given a NULL parameter the function returns the number of statistics that are available.
• rte_eth_xstats_get_id_by_name(): Searches for the statistic ID that matches
xstat_name. If found, the id integer is set.
• rte_eth_xstats_get_by_id(): Fills in an array of uint64_t values with matching
the provided ids array. If the ids array is NULL, it returns all statistics that are available.
Application Usage
Imagine an application that wants to view the dropped packet count. If no packets are dropped,
the application does not read any other metrics for performance reasons. If packets are
dropped, the application has a particular set of statistics that it requests. This “set” of statistics
allows the app to decide what next steps to perform. The following code-snippets show how
the xstats API can be used to achieve this goal.
First step is to get all statistics names and list them:
struct rte_eth_xstat_name *xstats_names;
uint64_t *values;
int len, i;
/* Retrieve xstats names, passing NULL for IDs to return all statistics */
if (len != rte_eth_xstats_get_names_by_id(port_id, xstats_names, NULL, len)) {
printf("Cannot get xstat names\n");
goto err;
}
goto err;
}
The application has access to the names of all of the statistics that the PMD exposes. The ap-
plication can decide which statistics are of interest, cache the ids of those statistics by looking
up the name as follows:
uint64_t id;
uint64_t value;
const char *xstat_name = "rx_errors";
The API provides flexibility to the application so that it can look up multiple statistics using an
array containing multiple id numbers. This reduces the function call overhead of retrieving
statistics, and makes lookup of multiple statistics simpler for the application.
#define APP_NUM_STATS 4
/* application cached these ids previously; see above */
uint64_t ids_array[APP_NUM_STATS] = {3,4,7,21};
uint64_t value_array[APP_NUM_STATS];
uint32_t i;
for(i = 0; i < APP_NUM_STATS; i++) {
printf("%d: %"PRIu64"\n", ids_array[i], value_array[i]);
}
This array lookup API for xstats allows the application create multiple “groups” of statistics, and
look up the values of those IDs using a single API call. As an end result, the application is able
to achieve its goal of monitoring a single statistic (“rx_errors” in this case), and if that shows
packets being dropped, it can easily retrieve a “set” of statistics using the IDs array parameter
to rte_eth_xstats_get_by_id function.
Sometimes a port has to be reset passively. For example when a PF is reset, all its VFs should
also be reset by the application to make them consistent with the PF. A DPDK application also
can call this function to trigger a port reset. Normally, a DPDK application would invokes this
function when an RTE_ETH_EVENT_INTR_RESET event is detected.
It is the duty of the PMD to trigger RTE_ETH_EVENT_INTR_RESET events and the appli-
cation should register a callback function to handle these events. When a PMD needs to
trigger a reset, it can trigger an RTE_ETH_EVENT_INTR_RESET event. On receiving an
NINE
9.1 Overview
This API provides a generic means to configure hardware to match specific ingress or egress
traffic, alter its fate and query related counters according to any number of user-defined rules.
It is named rte_flow after the prefix used for all its symbols, and is defined in rte_flow.h.
• Matching can be performed on packet data (protocol headers, payload) and properties
(e.g. associated physical port, virtual device function ID).
• Possible operations include dropping traffic, diverting it to specific queues, to vir-
tual/physical device functions or ports, performing tunnel offloads, adding marks and
so on.
It is slightly higher-level than the legacy filtering framework which it encompasses and super-
sedes (including all functions and filter types) in order to expose a single interface with an
unambiguous behavior that is common to all poll-mode drivers (PMDs).
Several methods to migrate existing applications are described in API migration.
9.2.1 Description
A flow rule is the combination of attributes with a matching pattern and a list of actions. Flow
rules form the basis of this API.
Flow rules can have several distinct actions (such as counting, encapsulating, decapsulating
before redirecting packets to a particular queue, etc.), instead of relying on several rules to
achieve this and having applications deal with hardware implementation details regarding their
order.
Support for different priority levels on a rule basis is provided, for example in order to force a
more specific rule to come before a more generic one for packets matched by both. However
hardware support for more than a single priority level cannot be guaranteed. When supported,
the number of available priority levels is usually low, which is why they can also be implemented
in software by PMDs (e.g. missing priority levels may be emulated by reordering rules).
In order to remain as hardware-agnostic as possible, by default all rules are considered to
have the same priority, which means that the order between overlapping rules (when a packet
is matched by several filters) is undefined.
54
Programmer’s Guide, Release 17.11.10
PMDs may refuse to create overlapping rules at a given priority level when they can be detected
(e.g. if a pattern matches an existing filter).
Thus predictable results for a given priority level can only be achieved with non-overlapping
rules, using perfect matching on all protocol layers.
Flow rules can also be grouped, the flow rule priority is specific to the group they belong to. All
flow rules in a given group are thus processed either before or after another group.
Support for multiple actions per rule may be implemented internally on top of non-default hard-
ware priorities, as a result both features may not be simultaneously available to applications.
Considering that allowed pattern/actions combinations cannot be known in advance and would
result in an impractically large number of capabilities to expose, a method is provided to vali-
date a given rule from the current device configuration state.
This enables applications to check if the rule types they need is supported at initialization time,
before starting their data path. This method can be used anytime, its only requirement being
that the resources needed by a rule should exist (e.g. a target RX queue should be configured
first).
Each defined rule is associated with an opaque handle managed by the PMD, applications are
responsible for keeping it. These can be used for queries and rules management, such as
retrieving counters or other data and destroying them.
To avoid resource leaks on the PMD side, handles must be explicitly destroyed by the applica-
tion before releasing associated resources such as queues and ports.
The following sections cover:
• Attributes (represented by struct rte_flow_attr): properties of a flow rule such
as its direction (ingress or egress) and priority.
• Pattern item (represented by struct rte_flow_item): part of a matching pattern
that either matches specific packet data or traffic properties. It can also describe proper-
ties of the pattern itself, such as inverted matching.
• Matching pattern: traffic properties to look for, a combination of any number of items.
• Actions (represented by struct rte_flow_action): operations to perform when-
ever a packet is matched by a pattern.
9.2.2 Attributes
Attribute: Group
Flow rules can be grouped by assigning them a common group number. Lower values have
higher priority. Group 0 has the highest priority.
Although optional, applications are encouraged to group similar rules as much as possible
to fully take advantage of hardware capabilities (e.g. optimized matching) and work around
limitations (e.g. a single pattern type possibly allowed in a given group).
Note that support for more than a single group is not guaranteed.
Attribute: Priority
A priority level can be assigned to a flow rule. Like groups, lower values denote higher priority,
with 0 as the maximum.
A rule with priority 0 in group 8 is always matched after a rule with priority 8 in group 0.
Group and priority levels are arbitrary and up to the application, they do not need to be con-
tiguous nor start from 0, however the maximum number varies between devices and may be
affected by existing flow rules.
If a packet is matched by several rules of a given group for a given priority level, the outcome
is undefined. It can take any path, may be duplicated or even cause unrecoverable errors.
Note that support for more than a single priority level is not guaranteed.
• Setting spec and optionally last without mask causes the PMD to use the default mask
defined for that item (defined as rte_flow_item_{name}_mask constants).
• Not setting any of them (assuming item type allows it) is equivalent to providing an empty
(zeroed) mask for broad (nonspecific) matching.
• mask is a simple bit-mask applied before interpreting the contents of spec and last,
which may yield unexpected results if not used carefully. For example, if for an IPv4
address field, spec provides 10.1.2.3, last provides 10.3.4.5 and mask provides
255.255.0.0, the effective range becomes 10.1.0.0 to 10.3.255.255.
Example of an item specification matching an Ethernet header:
A pattern is formed by stacking items starting from the lowest protocol layer to match. This
stacking restriction does not apply to meta items which can be placed anywhere in the stack
without affecting the meaning of the resulting pattern.
Patterns are terminated by END items.
Examples:
Table 9.5:
UDPv6 any-
where
Index Item
0 IPv6
1 UDP
2 END
If supported by the PMD, omitting one or several protocol layers at the bottom of the stack
as in the above example (missing an Ethernet specification) enables looking up anywhere in
packets.
It is unspecified whether the payload of supported encapsulations (e.g. VXLAN payload) is
matched by such a pattern, which may apply to inner, outer or both packets.
They match meta-data or affect pattern processing instead of matching packet data directly,
most of them do not need a specification structure. This particularity allows them to be speci-
fied anywhere in the stack without causing any side effect.
Item: END
End marker for item lists. Prevents further processing of items, thereby ending the pattern.
• Its numeric value is 0 for convenience.
• PMD support is mandatory.
• spec, last and mask are ignored.
Item: VOID
Item: INVERT
Inverted matching, i.e. process packets that do not match the pattern.
• spec, last and mask are ignored.
Table 9.10:
INVERT
Field Value
spec ignored
last ignored
mask ignored
Usage example, matching non-TCPv4 packets only:
Table 9.11:
Anything but
TCPv4
Index Item
0 INVERT
1 Ethernet
2 IPv4
3 TCP
4 END
Item: PF
Table 9.12: PF
Field Value
spec unset
last unset
mask unset
Item: VF
Table 9.13: VF
Field Subfield Value
spec id destination VF ID
last id upper range value
mask id zeroed to match any VF ID
Item: PORT
Matches packets coming from the specified physical port of the underlying device.
The first PORT item overrides the physical port normally associated with the specified DPDK
input port (port_id). This item can be provided several times to match additional physical ports.
Note that physical ports are not necessarily tied to DPDK input ports (port_id) when those are
not under DPDK control. Possible values are specific to each device, they are not necessarily
indexed from zero and may not be contiguous.
As a device property, the list of allowed values as well as the value associated with a port_id
should be retrieved by other means.
• Default mask matches any port index.
Most of these are basically protocol header definitions with associated bit-masks. They must
be specified (stacked) from lowest to highest protocol layer to form a matching pattern.
The following list is not exhaustive, new protocols will be added in the future.
Item: ANY
Matches any protocol in place of the current layer, a single ANY may also stand for several
protocol layers.
This is usually specified as the first pattern item when looking for a protocol anywhere in a
packet.
• Default mask stands for any number of layers.
Item: RAW
0 >= 10 B == 20 B
| |<--------->| |<--------->|
| | | | |
|-----|------|-----|-----|-----|-----|-----------|-----|------|
| ETH | IPv4 | UDP | ... | baz | foo | ......... | bar | .... |
|-----|------|-----|-----|-----|-----|-----------|-----|------|
| |
|<--------------------------->|
== 29 B
Note that matching subsequent pattern items would resume after “baz”, not “bar” since match-
ing is always performed after the previous item of the stack.
Item: ETH
Item: VLAN
Item: IPV4
Item: IPV6
Item: ICMP
Item: UDP
Item: TCP
Item: SCTP
Item: VXLAN
Item: E_TAG
Item: NVGRE
Item: MPLS
Item: GRE
Item: FUZZY
Item: ESP
9.2.7 Actions
Each possible action is represented by a type. Some have associated configuration structures.
Several actions combined in a list can be affected to a flow rule. That list is not ordered.
They fall in three categories:
• Terminating actions (such as QUEUE, DROP, RSS, PF, VF) that prevent processing
matched packets by subsequent flow rules, unless overridden with PASSTHRU.
• Non-terminating actions (PASSTHRU, DUP) that leave matched packets up for additional
processing by subsequent flow rules.
• Other non-terminating meta actions that do not affect the fate of packets (END, VOID,
MARK, FLAG, COUNT, SECURITY).
When several actions are combined in a flow rule, they should all have different types (e.g.
dropping a packet twice is not possible).
Only the last action of a given type is taken into account. PMDs still perform error checking on
the entire list.
Like matching patterns, action lists are terminated by END items.
Note that PASSTHRU is the only action able to override a terminating rule.
Example of action that redirects packets to queue index 10:
Table 9.21:
Queue action
Field Value
index 10
Action lists examples, their order is not significant, applications must consider all actions to be
performed simultaneously:
Common action types are described in this section. Like pattern item types, this list is not
exhaustive as new actions will be added in the future.
Action: END
End marker for action lists. Prevents further processing of actions, thereby ending the list.
• Its numeric value is 0 for convenience.
• PMD support is mandatory.
• No configurable properties.
Table 9.26:
END
Field
no properties
Action: VOID
Table 9.27:
VOID
Field
no properties
Action: PASSTHRU
Leaves packets up for additional processing by subsequent flow rules. This is the default when
a rule does not contain a terminating action, but can be specified to force a rule to become
non-terminating.
• No configurable properties.
Table 9.28:
PASSTHRU
Field
no properties
Example to copy a packet to a queue and continue processing by subsequent flow rules:
Action: MARK
Attaches an integer value to packets and sets PKT_RX_FDIR and PKT_RX_FDIR_ID mbuf
flags.
This value is arbitrary and application-defined. Maximum allowed value depends on the under-
lying implementation. It is returned in the hash.fdir.hi mbuf field.
Action: FLAG
Flags packets. Similar to Action: MARK without a specific value; only sets the PKT_RX_FDIR
mbuf flag.
• No configurable properties.
Table 9.31:
FLAG
Field
no properties
Action: QUEUE
Action: DROP
Drop packets.
• No configurable properties.
• Terminating by default.
• PASSTHRU overrides this action if both are specified.
Table 9.33:
DROP
Field
no properties
Action: COUNT
Table 9.34:
COUNT
Field
no properties
Query structure to retrieve and reset flow rule counters:
Action: DUP
Action: RSS
Similar to QUEUE, except RSS is additionally performed on packets to spread them among
several queues according to the provided parameters.
Note: RSS hash result is stored in the hash.rss mbuf field which overlaps hash.fdir.lo.
Since Action: MARK sets the hash.fdir.hi field only, both can be requested simultane-
ously.
• Terminating by default.
Action: PF
Table 9.38: PF
Field
no properties
Action: VF
Table 9.39: VF
Field Value
original use original VF ID if possible
vf VF ID to redirect packets to
Action: METER
Action: SECURITY
Perform the security action on flows matched by the pattern items according to the configura-
tion of the security session.
This action modifies the payload of matched flows. For INLINE_CRYPTO, the security protocol
headers and IV are fully provided by the application as specified in the flow pattern. The
payload of matching packets is encrypted on egress, and decrypted and authenticated on
ingress. For INLINE_PROTOCOL, the security protocol is fully offloaded to HW, providing full
encapsulation and decapsulation of packets in security protocols. The flow pattern specifies
both the outer security header fields and the inner packet fields. The security session specified
in the action must match the pattern parameters.
The security session specified in the action must be created on the same port as the flow
action that is being specified.
The ingress/egress flow attribute should match that specified in the security session if the
security session supports the definition of the direction.
Multiple flows can be configured to use the same security session.
• Non-terminating by default.
Other action types are planned but are not defined yet. These include the ability to alter packet
data in several ways, such as performing encapsulation/decapsulation of tunnel headers.
A rather simple API with few functions is provided to fully manage flow rules.
Each created flow rule is associated with an opaque, PMD-specific handle pointer. The appli-
cation is responsible for keeping it until the rule is destroyed.
Flows rules are represented by struct rte_flow objects.
9.3.1 Validation
Given that expressing a definite set of device capabilities is not practical, a dedicated function
is provided to check if a flow rule is supported and can be created.
int
rte_flow_validate(uint16_t port_id,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
struct rte_flow_error *error);
The flow rule is validated for correctness and whether it could be accepted by the device
given sufficient resources. The rule is checked against the current device mode and queue
configuration. The flow rule may also optionally be validated against existing flow rules and
device resources. This function has no effect on the target device.
The returned value is guaranteed to remain valid only as long as no successful calls to
rte_flow_create() or rte_flow_destroy() are made in the meantime and no device
parameter affecting flow rules in any way are modified, due to possible collisions or resource
limitations (although in such cases EINVAL should not be returned).
Arguments:
• port_id: port identifier of Ethernet device.
• attr: flow rule attributes.
• pattern: pattern specification (list terminated by the END pattern item).
• actions: associated actions (list terminated by the END action).
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case
of error only.
Return values:
• 0 if flow rule is valid and can be created. A negative errno value otherwise (rte_errno
is also set), the following errors are defined.
• -ENOSYS: underlying device does not support this functionality.
• -EINVAL: unknown or invalid rule specification.
• -ENOTSUP: valid but unsupported rule specification (e.g. partial bit-masks are unsup-
ported).
• EEXIST: collision with an existing rule. Only returned if device supports flow rule colli-
sion checking and there was a flow rule collision. Not receiving this return code is no
guarantee that creating the rule will not fail due to a collision.
• ENOMEM: not enough memory to execute the function, or if the device supports resource
validation, resource limitation on the device.
• -EBUSY: action cannot be performed due to busy device resources, may suc-
ceed if the affected queues or even the entire port are in a stopped state (see
rte_eth_dev_rx_queue_stop() and rte_eth_dev_stop()).
9.3.2 Creation
Creating a flow rule is similar to validating one, except the rule is actually created and a handle
returned.
struct rte_flow *
rte_flow_create(uint16_t port_id,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action *actions[],
struct rte_flow_error *error);
Arguments:
• port_id: port identifier of Ethernet device.
• attr: flow rule attributes.
• pattern: pattern specification (list terminated by the END pattern item).
• actions: associated actions (list terminated by the END action).
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case
of error only.
Return values:
A valid handle in case of success, NULL otherwise and rte_errno is set to the positive
version of one of the error codes defined for rte_flow_validate().
9.3.3 Destruction
Flow rules destruction is not automatic, and a queue or a port should not be released if any
are still attached to them. Applications must take care of performing this step before releasing
resources.
int
rte_flow_destroy(uint16_t port_id,
struct rte_flow *flow,
struct rte_flow_error *error);
Failure to destroy a flow rule handle may occur when other flow rules depend on it, and de-
stroying it would result in an inconsistent state.
This function is only guaranteed to succeed if handles are destroyed in reverse order of their
creation.
Arguments:
9.3.4 Flush
Convenience function to destroy all flow rule handles associated with a port. They are released
as with successive calls to rte_flow_destroy().
int
rte_flow_flush(uint16_t port_id,
struct rte_flow_error *error);
In the unlikely event of failure, handles are still considered destroyed and no longer valid but
the port must be assumed to be in an inconsistent state.
Arguments:
• port_id: port identifier of Ethernet device.
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case
of error only.
Return values:
• 0 on success, a negative errno value otherwise and rte_errno is set.
9.3.5 Query
Arguments:
• port_id: port identifier of Ethernet device.
• flow: flow rule handle to query.
• action: action type to query.
• data: pointer to storage for the associated query data type.
• error: perform verbose error reporting if not NULL. PMDs initialize this structure in case
of error only.
Return values:
The general expectation for ingress traffic is that flow rules process it first; the remaining un-
matched or pass-through traffic usually ends up in a queue (with or without RSS, locally or in
some sub-device instance) depending on the global configuration settings of a port.
While fine from a compatibility standpoint, this approach makes drivers more complex as they
have to check for possible side effects outside of this API when creating or destroying flow
rules. It results in a more limited set of available rule types due to the way device resources
are assigned (e.g. no support for the RSS action even on capable hardware).
Given that nonspecific traffic can be handled by flow rules as well, isolated mode is a means
for applications to tell a driver that ingress on the underlying port must be injected from the
defined flow rules only; that no default traffic is expected outside those rules.
This has the following benefits:
• Applications get finer-grained control over the kind of traffic they want to receive (no traffic
by default).
• More importantly they control at what point nonspecific traffic is handled relative to other
flow rules, by adjusting priority levels.
• Drivers can assign more hardware resources to flow rules and expand the set of sup-
ported rule types.
Because toggling isolated mode may cause profound changes to the ingress processing path
of a driver, it may not be possible to leave it once entered. Likewise, existing flow rules or global
configuration settings may prevent a driver from entering isolated mode.
Applications relying on this mode are therefore encouraged to toggle it as soon as possible
after device initialization, ideally before the first call to rte_eth_dev_configure() to avoid
possible failures due to conflicting settings.
Once effective, the following functionality has no effect on the underlying port and may return
errors such as ENOTSUP (“not supported”):
• Toggling promiscuous mode.
• Toggling allmulticast mode.
• Configuring MAC addresses.
• Configuring multicast addresses.
• Configuring VLAN filters.
• Configuring Rx filters through the legacy API (e.g. FDIR).
• Configuring global RSS settings.
int
rte_flow_isolate(uint16_t port_id, int set, struct rte_flow_error *error);
Arguments:
• port_id: port identifier of Ethernet device.
The defined errno values may not be accurate enough for users or application developers
who want to investigate issues related to flow rules management. A dedicated error object is
defined for this purpose:
enum rte_flow_error_type {
RTE_FLOW_ERROR_TYPE_NONE, /**< No error. */
RTE_FLOW_ERROR_TYPE_UNSPECIFIED, /**< Cause unspecified. */
RTE_FLOW_ERROR_TYPE_HANDLE, /**< Flow rule (handle). */
RTE_FLOW_ERROR_TYPE_ATTR_GROUP, /**< Group field. */
RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY, /**< Priority field. */
RTE_FLOW_ERROR_TYPE_ATTR_INGRESS, /**< Ingress field. */
RTE_FLOW_ERROR_TYPE_ATTR_EGRESS, /**< Egress field. */
RTE_FLOW_ERROR_TYPE_ATTR, /**< Attributes structure. */
RTE_FLOW_ERROR_TYPE_ITEM_NUM, /**< Pattern length. */
RTE_FLOW_ERROR_TYPE_ITEM, /**< Specific pattern item. */
RTE_FLOW_ERROR_TYPE_ACTION_NUM, /**< Number of actions. */
RTE_FLOW_ERROR_TYPE_ACTION, /**< Specific action. */
};
struct rte_flow_error {
enum rte_flow_error_type type; /**< Cause field and error types. */
const void *cause; /**< Object responsible for the error. */
const char *message; /**< Human-readable error message. */
};
Error type RTE_FLOW_ERROR_TYPE_NONE stands for no error, in which case remaining fields
can be ignored. Other error types describe the type of the object pointed by cause.
If non-NULL, cause points to the object responsible for the error. For a flow rule, this may be
a pattern item or an individual action.
If non-NULL, message provides a human-readable error message.
This object is normally allocated by applications and set by PMDs in case of error, the message
points to a constant string which does not need to be freed by the application, however its
pointer can be considered valid only as long as its associated DPDK port remains configured.
Closing the underlying device or unloading the PMD invalidates it.
9.6 Helpers
This function initializes error (if non-NULL) with the provided parameters and sets
rte_errno to code. A negative error code is then returned.
9.7 Caveats
• DPDK does not keep track of flow rules definitions or flow rule objects automatically.
Applications may keep track of the former and must keep track of the latter. PMDs may
also do it for internal needs, however this must not be relied on by applications.
• Flow rules are not maintained between successive port initializations. An application
exiting without releasing them and restarting must re-create them from scratch.
• API operations are synchronous and blocking (EAGAIN cannot be returned).
• There is no provision for reentrancy/multi-thread safety, although nothing should prevent
different devices from being configured at the same time. PMDs may protect their control
path functions accordingly.
• Stopping the data path (TX/RX) should not be necessary when managing flow rules. If
this cannot be achieved naturally or with workarounds (such as temporarily replacing the
burst function pointers), an appropriate error code must be returned (EBUSY).
• PMDs, not applications, are responsible for maintaining flow rules configuration when
stopping and restarting a port or performing other actions which may affect them. They
can only be destroyed explicitly by applications.
For devices exposing multiple ports sharing global settings affected by flow rules:
• All ports under DPDK control must behave consistently, PMDs are responsible for making
sure that existing flow rules on a port are not affected by other ports.
• Ports not under DPDK control (unaffected or handled by other applications) are user’s
responsibility. They may affect existing flow rules and cause undefined behavior. PMDs
aware of this may prevent flow rules creation altogether in such cases.
9.7. Caveats 80
Programmer’s Guide, Release 17.11.10
• Public API functions do not process flow rules definitions at all before calling PMD func-
tions (no basic error checking, no validation whatsoever). They only make sure these
callbacks are non-NULL or return the ENOSYS (function not supported) error.
This interface additionally defines the following helper function:
• rte_flow_ops_get(): get generic flow operations structure from a port.
More will be added over time.
Each flow rule comes with its own, per-layer bit-masks, while hardware may support only a
single, device-wide bit-mask for a given layer type, so that two IPv4 rules cannot use different
bit-masks.
The expected behavior in this case is that PMDs automatically configure global bit-masks ac-
cording to the needs of the first flow rule created.
Subsequent rules are allowed only if their bit-masks match those, the EEXIST error code
should be returned otherwise.
Many protocols can be simulated by crafting patterns with the Item: RAW type.
PMDs can rely on this capability to simulate support for protocols with headers not directly
recognized by hardware.
This pattern item stands for anything, which can be difficult to translate to something hardware
would understand, particularly if followed by more specific types.
Consider the following pattern:
• When combined with Action: QUEUE, packet counting (Action: COUNT ) and tagging
(Action: MARK or Action: FLAG) may be implemented in software as long as the target
queue is used by a single rule.
• A rule specifying both Action: DUP + Action: QUEUE may be translated to two hidden
rules combining Action: QUEUE and Action: PASSTHRU.
• When a single target queue is provided, Action: RSS can also be implemented through
Action: QUEUE.
While it would naturally make sense, flow rules cannot be assumed to be processed by hard-
ware in the same order as their creation for several reasons:
• They may be managed internally as a tree or a hash table instead of a list.
• Removing a flow rule before adding another one can either put the new rule at the end of
the list or reuse a freed entry.
• Duplication may occur when packets are matched by several rules.
For overlapping rules (particularly in order to use Action: PASSTHRU) predictable behavior is
only guaranteed by using different priority levels.
Priority levels are not necessarily implemented in hardware, or may be severely limited (e.g. a
single priority bit).
For these reasons, priority levels may be implemented purely in software by PMDs.
• For devices expecting flow rules to be added in the correct order, PMDs may destroy and
re-create existing rules after adding a new one with a higher priority.
• A configurable number of dummy or empty rules can be created at initialization time to
save high priority slots for later.
• In order to save priority levels, PMDs may evaluate whether rules are likely to collide and
adjust their priority accordingly.
• A device profile selection function which could be used to force a permanent profile in-
stead of relying on its automatic configuration based on existing flow rules.
• A method to optimize rte_flow rules with specific pattern items and action types gener-
ated on the fly by PMDs. DPDK should assign negative numbers to these in order to not
collide with the existing types. See Negative types.
• Adding specific egress pattern items and actions as described in Attribute: Traffic direc-
tion.
• Optional software fallback when PMDs are unable to handle requested flow rules so
applications do not have to implement their own.
Exhaustive list of deprecated filter types (normally prefixed with RTE_ETH_FILTER_) found in
rte_eth_ctrl.h and methods to convert them to rte_flow rules.
MACVLAN can be translated to a basic Item: ETH flow rule with a terminating Action: VF or
Action: PF .
ETHERTYPE is basically an Item: ETH flow rule with a terminating Action: QUEUE or Action:
DROP.
FLEXIBLE can be translated to one Item: RAW pattern with a terminating Action: QUEUE and
a defined priority level.
SYN is a Item: TCP rule with only the syn bit enabled and masked, and a terminating Action:
QUEUE.
Priority level can be set to simulate the high priority bit.
NTUPLE is similar to specifying an empty L2, Item: IPV4 as L3 with Item: TCP or Item: UDP
as L4 and a terminating Action: QUEUE.
A priority level can be specified as well.
FDIR is more complex than any other type, there are several methods to emulate its function-
ality. It is summarized for the most part in the table below.
A few features are intentionally not supported:
• The ability to configure the matching input set and masks for the entire device, PMDs
should take care of it automatically according to the requested flow rules.
For example if a device supports only one bit-mask per protocol type, source/address
IPv4 bit-masks can be made immutable by the first created rule. Subsequent IPv4 or
TCPv4 rules can only be created if they are compatible.
Note that only protocol bit-masks affected by existing flow rules are immutable, others can
be changed later. They become mutable again after the related flow rules are destroyed.
• Returning four or eight bytes of matched data when using flex bytes filtering. Although a
specific action could implement it, it conflicts with the much more useful 32 bits tagging
on devices that support it.
• Side effects on RSS processing of the entire device. Flow rules that conflict with the
current device configuration should not be allowed. Similarly, device configuration should
not be allowed when it affects existing flow rules.
• Device modes of operation. “none” is unsupported since filtering cannot be disabled as
long as a flow rule is present.
• “MAC VLAN” or “tunnel” perfect matching modes should be automatically set according
to the created flow rules.
• Signature mode of operation is not defined but could be handled through “FUZZY” item.
9.11.8 HASH
There is no counterpart to this filter type because it translates to a global device setting instead
of a pattern item. Device settings are automatically set according to the created flow rules.
All packets are matched. This type alters incoming packets to encapsulate them in a chosen
tunnel type, optionally redirect them to a VF as well.
The destination pool for tag based forwarding can be emulated with other flow rules using
Action: DUP.
TEN
10.1 Overview
This is the generic API for the Quality of Service (QoS) Traffic Metering and Policing (MTR) of
Ethernet devices. This API is agnostic of the underlying HW, SW or mixed HW-SW implemen-
tation.
The main features are:
• Part of DPDK rte_ethdev API
• Capability query API
• Metering algorithms: RFC 2697 Single Rate Three Color Marker (srTCM), RFC 2698 and
RFC 4115 Two Rate Three Color Marker (trTCM)
• Policer actions (per meter output color): recolor, drop
• Statistics (per policer output color)
The metering and policing stage typically sits on top of flow classification, which is why the
MTR objects are enabled through a special “meter” action.
The MTR objects are created and updated in their own name space (rte_mtr) within the
librte_ether library. Whether an MTR object is private to a flow or potentially shared by
several flows has to be specified at its creation time.
Once successfully created, an MTR object is hooked into the RX processing path of the Ether-
net device by linking it to one or several flows through the dedicated “meter” flow action. One
or several “meter” actions can be registered for the same flow. An MTR object can only be
destroyed if there are no flows using it.
Traffic metering determines the color for the current packet (green, yellow, red) based on the
previous history for this flow as maintained by the MTR object. The policer can do nothing,
override the color the packet or drop the packet. Statistics counters are maintained for MTR
object, as configured.
The processing done for each input packet hitting an MTR object is:
88
Programmer’s Guide, Release 17.11.10
• Traffic metering: The packet is assigned a color (the meter output color) based on the
previous traffic history reflected in the current state of the MTR object, according to the
specific traffic metering algorithm. The traffic metering algorithm can typically work in
color aware mode, in which case the input packet already has an initial color (the input
color), or in color blind mode, which is equivalent to considering all input packets initially
colored as green.
• Policing: There is a separate policer action configured for each meter output color, which
can:
– Drop the packet.
– Keep the same packet color: the policer output color matches the meter output color
(essentially a no-op action).
– Recolor the packet: the policer output color is set to a different color than the meter
output color. The policer output color is the output color of the packet, which is set
in the packet meta-data (i.e. struct rte_mbuf::sched::color).
• Statistics: The set of counters maintained for each MTR object is configurable and sub-
ject to the implementation support. This set includes the number of packets and bytes
dropped or passed for each output color.
ELEVEN
11.1 Overview
This is the generic API for the Quality of Service (QoS) Traffic Management of Ethernet devices,
which includes the following main features: hierarchical scheduling, traffic shaping, congestion
management, packet marking. This API is agnostic of the underlying HW, SW or mixed HW-
SW implementation.
Main features:
• Part of DPDK rte_ethdev API
• Capability query API per port, per hierarchy level and per hierarchy node
• Scheduling algorithms: Strict Priority (SP), Weighed Fair Queuing (WFQ)
• Traffic shaping: single/dual rate, private (per node) and shared (by multiple nodes)
shapers
• Congestion management for hierarchy leaf nodes: algorithms of tail drop, head drop,
WRED, private (per node) and shared (by multiple nodes) WRED contexts
• Packet marking: IEEE 802.1q (VLAN DEI), IETF RFC 3168 (IPv4/IPv6 ECN for TCP and
SCTP), IETF RFC 2597 (IPv4 / IPv6 DSCP)
The aim of these APIs is to advertise the capability information (i.e critical parameter values)
that the TM implementation (HW/SW) is able to support for the application. The APIs supports
the information disclosure at the TM level, at any hierarchical level of the TM and at any node
level of the specific hierarchical level. Such information helps towards rapid understanding of
whether a specific implementation does meet the needs to the user application.
At the TM level, users can get high level idea with the help of various parameters such as
maximum number of nodes, maximum number of hierarchical levels, maximum number of
shapers, maximum number of private shapers, type of scheduling algorithm (Strict Priority,
Weighted Fair Queueing , etc.), etc., supported by the implementation.
Likewise, users can query the capability of the TM at the hierarchical level to have more gran-
ular knowledge about the specific level. The various parameters such as maximum number of
nodes at the level, maximum number of leaf/non-leaf nodes at the level, type of the shaper(dual
rate, single rate) supported at the level if node is non-leaf type etc., are exposed as a result of
hierarchical level capability query.
90
Programmer’s Guide, Release 17.11.10
Finally, the node level capability API offers knowledge about the capability supported by the
node at any specific level. The information whether the support is available for private shaper,
dual rate shaper, maximum and minimum shaper rate, etc. is exposed by node level capability
API.
The fundamental scheduling algorithms that are supported are Strict Priority (SP) and
Weighted Fair Queuing (WFQ). The SP and WFQ algorithms are supported at the level of
each node of the scheduling hierarchy, regardless of the node level/position in the tree. The
SP algorithm is used to schedule between sibling nodes with different priority, while WFQ is
used to schedule between groups of siblings that have the same priority.
Algorithms such as Weighed Round Robin (WRR), byte-level WRR, Deficit WRR (DWRR),
etc are considered approximations of the ideal WFQ and are therefore assimilated to WFQ,
although an associated implementation-dependent accuracy, performance and resource usage
trade-off might exist.
The TM API provides support for single rate and dual rate shapers (rate limiters) for the hierar-
chy nodes, subject to the specific implementation support being available.
Each hierarchy node has zero or one private shaper (only one node using it) and/or zero, one
or several shared shapers (multiple nodes use the same shaper instance). A private shaper
is used to perform traffic shaping for a single node, while a shared shaper is used to perform
traffic shaping for a group of nodes.
The configuration of private and shared shapers is done through the definition of shaper pro-
files. Any shaper profile (single rate or dual rate shaper) can be used by one or several shaper
instances (either private or shared).
Single rate shapers use a single token bucket. Therefore, single rate shaper is configured by
setting the rate of the committed bucket to zero, which effectively disables this bucket. The
peak bucket is used to limit the rate and the burst size for the single rate shaper. Dual rate
shapers use both the committed and the peak token buckets. The rate of the peak bucket has
to be bigger than zero, as well as greater than or equal to the rate of the committed bucket.
Congestion management is used to control the admission of packets into a packet queue
or group of packet queues on congestion. The congestion management algorithms that are
supported are: Tail Drop, Head Drop and Weighted Random Early Detection (WRED). They
are made available for every leaf node in the hierarchy, subject to the specific implementation
supporting them. On request of writing a new packet into the current queue while the queue
is full, the Tail Drop algorithm drops the new packet while leaving the queue unmodified, as
opposed to the Head Drop* algorithm, which drops the packet at the head of the queue (the
oldest packet waiting in the queue) and admits the new packet at the tail of the queue.
The Random Early Detection (RED) algorithm works by proactively dropping more and more
input packets as the queue occupancy builds up. When the queue is full or almost full, RED
effectively works as Tail Drop. The Weighted RED (WRED) algorithm uses a separate set of
RED thresholds for each packet color and uses separate set of RED thresholds for each packet
color.
Each hierarchy leaf node with WRED enabled as its congestion management mode has zero
or one private WRED context (only one leaf node using it) and/or zero, one or several shared
WRED contexts (multiple leaf nodes use the same WRED context). A private WRED context is
used to perform congestion management for a single leaf node, while a shared WRED context
is used to perform congestion management for a group of leaf nodes.
The configuration of WRED private and shared contexts is done through the definition of WRED
profiles. Any WRED profile can be used by one or several WRED contexts (either private or
shared).
The TM APIs have been provided to support various types of packet marking such as VLAN
DEI packet marking (IEEE 802.1Q), IPv4/IPv6 ECN marking of TCP and SCTP packets (IETF
RFC 3168) and IPv4/IPv6 DSCP packet marking (IETF RFC 2597). All VLAN frames of a given
color get their DEI bit set if marking is enabled for this color. In case, when marking for a given
color is not enabled, the DEI bit is left as is (either set or not).
All IPv4/IPv6 packets of a given color with ECN set to 2’b01 or 2’b10 carrying TCP or SCTP
have their ECN set to 2’b11 if the marking feature is enabled for the current color, otherwise
the ECN field is left as is.
All IPv4/IPv6 packets have their color marked into DSCP bits 3 and 4 as follows: green mapped
to Low Drop Precedence (2’b01), yellow to Medium (2’b10) and red to High (2’b11). Marking
needs to be explicitly enabled for each color; when not enabled for a given color, the DSCP
field of all packets with that color is left as is.
The TM hierarchical tree consists of leaf nodes and non-leaf nodes. Each leaf node sits on top
of a scheduling queue of the current Ethernet port. Therefore, the leaf nodes have predefined
IDs in the range of 0... (N-1), where N is the number of scheduling queues of the current
Ethernet port. The non-leaf nodes have their IDs generated by the application outside of the
above range, which is reserved for leaf nodes.
Each non-leaf node has multiple inputs (its children nodes) and single output (which is input
to its parent node). It arbitrates its inputs using Strict Priority (SP) and Weighted Fair Queuing
(WFQ) algorithms to schedule input packets to its output while observing its shaping (rate
limiting) constraints.
The children nodes with different priorities are scheduled using the SP algorithm based on their
priority, with 0 as the highest priority. Children with the same priority are scheduled using the
WFQ algorithm according to their weights. The WFQ weight of a given child node is relative
to the sum of the weights of all its sibling nodes that have the same priority, with 1 as the
lowest weight. For each SP priority, the WFQ weight mode can be set as either byte-based or
packet-based.
The hierarchy is specified by incrementally adding nodes to build up the scheduling tree. The
first node that is added to the hierarchy becomes the root node and all the nodes that are
subsequently added have to be added as descendants of the root node. The parent of the root
node has to be specified as RTE_TM_NODE_ID_NULL and there can only be one node with
this parent ID (i.e. the root node). The unique ID that is assigned to each node when the node
is created is further used to update the node configuration or to connect children nodes to it.
During this phase, some limited checks on the hierarchy specification can be conducted, usu-
ally limited in scope to the current node, its parent node and its sibling nodes. At this time, since
the hierarchy is not fully defined, there is typically no real action performed by the underlying
implementation.
The hierarchy commit API is called during the port initialization phase (before the Ethernet port
is started) to freeze the start-up hierarchy. This function typically performs the following steps:
• It validates the start-up hierarchy that was previously defined for the current port through
successive node add API invocations.
• Assuming successful validation, it performs all the necessary implementation specific
operations to install the specified hierarchy on the current port, with immediate effect
once the port is started.
This function fails when the currently configured hierarchy is not supported by the Ethernet port,
in which case the user can abort or try out another hierarchy configuration (e.g. a hierarchy
with less leaf nodes), which can be built from scratch or by modifying the existing hierarchy
configuration. Note that this function can still fail due to other causes (e.g. not enough memory
available in the system, etc.), even though the specified hierarchy is supported in principle by
the current port.
The TM API provides support for on-the-fly changes to the scheduling hierarchy, thus op-
erations such as node add/delete, node suspend/resume, parent node update, etc., can be
invoked after the Ethernet port has been started, subject to the specific implementation sup-
porting them. The set of dynamic updates supported by the implementation is advertised
through the port capability set.
TWELVE
The cryptodev library provides a Crypto device framework for management and provisioning
of hardware and software Crypto poll mode drivers, defining generic APIs which support a
number of different Crypto operations. The framework currently only supports cipher, authen-
tication, chained cipher/authentication and AEAD symmetric Crypto operations.
The cryptodev library follows the same basic principles as those used in DPDKs Ethernet
Device framework. The Crypto framework provides a generic Crypto device framework which
supports both physical (hardware) and virtual (software) Crypto devices as well as a generic
Crypto API which allows Crypto devices to be managed and configured and supports Crypto
operations to be provisioned on Crypto poll mode driver.
Physical Crypto devices are discovered during the PCI probe/enumeration of the EAL function
which is executed at DPDK initialization, based on their PCI device identifier, each unique
PCI BDF (bus/bridge, device, function). Specific physical Crypto devices, like other physical
devices in DPDK can be white-listed or black-listed using the EAL command line options.
Virtual devices can be created by two mechanisms, either using the EAL command line options
or from within the application using an EAL API directly.
From the command line using the –vdev EAL option
--vdev 'crypto_aesni_mb0,max_nb_queue_pairs=2,max_nb_sessions=1024,socket_id=0'
94
Programmer’s Guide, Release 17.11.10
struct rte_cryptodev_qp_conf {
uint32_t nb_descriptors; /**< Number of descriptors per queue pair */
};
The Crypto device Library as the Poll Mode Driver library support NUMA for when a processor’s
logical cores and interfaces utilize its local memory. Therefore Crypto operations, and in the
case of symmetric Crypto operations, the session and the mbuf being operated on, should
be allocated from memory pools created in the local memory. The buffers should, if possible,
remain on the local processor to obtain the best performance results and buffer descriptors
should be populated with mbufs allocated from a mempool allocated from local memory.
The run-to-completion model also performs better, especially in the case of virtual Crypto de-
vices, if the Crypto operation and session and data buffer is in local memory instead of a
remote processor’s memory. This is also true for the pipe-line model provided all logical cores
used are located on the same processor.
Multiple logical cores should never share the same queue pair for enqueuing operations or de-
queuing operations on the same Crypto device since this would require global locks and hinder
performance. It is however possible to use a different logical core to dequeue an operation on
a queue pair from the logical core which it was enqueued on. This means that a crypto burst
enqueue/dequeue APIs are a logical place to transition from one logical core to another in a
packet processing pipeline.
Crypto devices define their functionality through two mechanisms, global device features and
algorithm capabilities. Global devices features identify device wide level features which are
applicable to the whole device such as the device having hardware acceleration or supporting
symmetric Crypto operations,
The capabilities mechanism defines the individual algorithms/functions which the device sup-
ports, such as a specific symmetric Crypto cipher, authentication operation or Authenticated
Encryption with Associated Data (AEAD) operation.
Crypto capabilities which identify particular algorithm which the Crypto PMD supports are de-
fined by the operation type, the operation transform, the transform identifier and then the par-
ticulars of the transform. For the full scope of the Crypto capability see the definition of the
structure in the DPDK API Reference.
struct rte_cryptodev_capabilities;
Each Crypto poll mode driver defines its own private array of capabilities for the operations it
supports. Below is an example of the capabilities for a PMD which supports the authentication
algorithm SHA1_HMAC and the cipher algorithm AES_CBC.
Discovering the features and capabilities of a Crypto device poll mode driver is achieved
through the rte_cryptodev_info_get function.
void rte_cryptodev_info_get(uint8_t dev_id,
struct rte_cryptodev_info *dev_info);
This allows the user to query a specific Crypto PMD and get all the device features and ca-
pabilities. The rte_cryptodev_info structure contains all the relevant information for the
device.
struct rte_cryptodev_info {
const char *driver_name;
uint8_t driver_id;
struct rte_pci_device *pci_dev;
uint64_t feature_flags;
unsigned max_nb_queue_pairs;
struct {
unsigned max_nb_sessions;
} sym;
};
Scheduling of Crypto operations on DPDK’s application data path is performed using a burst
oriented asynchronous API set. A queue pair on a Crypto device accepts a burst of Crypto
operations using enqueue burst API. On physical Crypto devices the enqueue burst API will
place the operations to be processed on the devices hardware input queue, for virtual devices
the processing of the Crypto operations is usually completed during the enqueue call to the
Crypto device. The dequeue burst API will retrieve any processed operations available from the
queue pair on the Crypto device, from physical devices this is usually directly from the devices
processed queue, and for virtual device’s from a rte_ring where processed operations are
place after being processed on the enqueue call.
The burst enqueue API uses a Crypto device identifier and a queue pair identifier to specify the
Crypto device queue pair to schedule the processing on. The nb_ops parameter is the number
of operations to process which are supplied in the ops array of rte_crypto_op structures.
The enqueue function returns the number of operations it actually enqueued for processing, a
return value equal to nb_ops means that all packets have been enqueued.
uint16_t rte_cryptodev_enqueue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_crypto_op **ops, uint16_t nb_ops)
The dequeue API uses the same format as the enqueue API of processed but the nb_ops
and ops parameters are now used to specify the max processed operations the user wishes
to retrieve and the location in which to store them. The API call returns the actual number of
processed operations returned, this can never be larger than nb_ops.
uint16_t rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
struct rte_crypto_op **ops, uint16_t nb_ops)
Crypto Operation
General
(structOperation
rte_crypto_op)
Data
Operation
(struct rte_crypto_sym_op)
Specific Data
private data
If Crypto operations are allocated from a Crypto operation mempool, see next section, there is
also the ability to allocate private memory with the operation for applications purposes.
Application software is responsible for specifying all the operation specific fields in the
rte_crypto_op structure which are then used by the Crypto PMD to process the requested
operation.
The cryptodev library provides an API set for managing Crypto operations which utilize the
Mempool Library to allocate operation buffers. Therefore, it ensures that the crytpo op-
eration is interleaved optimally across the channels and ranks for optimal processing. A
rte_crypto_op contains a field indicating the pool that it originated from. When calling
rte_crypto_op_free(op), the operation returns to its original pool.
extern struct rte_mempool *
rte_crypto_op_pool_create(const char *name, enum rte_crypto_op_type type,
unsigned nb_elts, unsigned cache_size, uint16_t priv_size,
int socket_id);
The cryptodev library currently provides support for the following symmetric Crypto operations;
cipher, authentication, including chaining of these operations, as well as also supporting AEAD
operations.
Sessions are used in symmetric cryptographic processing to store the immutable data defined
in a cryptographic transform which is used in the operation processing of a packet flow. Ses-
sions are used to manage information such as expand cipher keys and HMAC IPADs and
OPADs, which need to be calculated for a particular Crypto operation, but are immutable on a
packet to packet basis for a flow. Crypto sessions cache this immutable data in a optimal way
for the underlying PMD and this allows further acceleration of the offload of Crypto workloads.
void *sess_private_data[]
...
Crypto Driver Private Session
The Crypto device framework provides APIs to allocate and initizalize sessions for crypto de-
vices, where sessions are mempool objects. It is the application’s responsibility to create and
manage the session mempools. This approach allows for different scenarios such as having a
single session mempool for all crypto devices (where the mempool object size is big enough
to hold the private session of any crypto device), as well as having multiple session mempools
of different sizes for better memory usage.
An application can use rte_cryptodev_get_private_session_size() to get the pri-
vate session size of given crypto device. This function would allow an application to calculate
the max device session size of all crypto devices to create a single session mempool. If instead
an application creates multiple session mempools, the Crypto device framework also provides
rte_cryptodev_get_header_session_size to get the size of an uninitialized session.
The API does not place a limit on the number of transforms that can be chained together but
this will be limited by the underlying Crypto device poll mode driver which is processing the
operation.
The symmetric Crypto operation structure contains all the mutable data relating to performing
symmetric cryptographic processing on a referenced mbuf data buffer. It is used for either
cipher, authentication, AEAD and chained operations.
As a minimum the symmetric operation must have a source data buffer (m_src), a valid session
(or transform chain if in session-less mode) and the minimum authentication/ cipher/ AEAD pa-
rameters required depending on the type of operation specified in the session or the transform
chain.
struct rte_crypto_sym_op {
struct rte_mbuf *m_src;
struct rte_mbuf *m_dst;
union {
(struct
Symmetric
rte_crypto_sym_xform)
Transform
(struct
nextrte_crypto_sym_xform
transform *)
(enum
transform (struct
Symmetric
rte_crypto_sym_xform_type)
type rte_crypto_sym_xform)
Transform
(struct
nextrte_crypto_sym_xform
transform *)
struct
struct
Transform
rte_crypto_cipher_xform
rte_crypto_auth_xform
rte_crypto_aead_xform
Parameters (enum
transform
rte_crypto_sym_xform_type)
type
struct
struct
Transform
rte_crypto_cipher_xform
rte_crypto_auth_xform
rte_crypto_aead_xform
Parameters
union {
struct {
struct {
uint32_t offset;
uint32_t length;
} data; /**< Data offsets and length for AEAD */
struct {
uint8_t *data;
rte_iova_t phys_addr;
} digest; /**< Digest parameters */
struct {
uint8_t *data;
rte_iova_t phys_addr;
} aad;
/**< Additional authentication parameters */
} aead;
struct {
struct {
struct {
uint32_t offset;
uint32_t length;
} data; /**< Data offsets and length for ciphering */
} cipher;
struct {
struct {
uint32_t offset;
uint32_t length;
} data;
/**< Data offsets and length for authentication */
struct {
uint8_t *data;
rte_iova_t phys_addr;
} digest; /**< Digest parameters */
} auth;
};
};
};
There are various sample applications that show how to use the cryptodev library, such as the
L2fwd with Crypto sample application (L2fwd-crypto) and the IPSec Security Gateway applica-
tion (ipsec-secgw).
While these applications demonstrate how an application can be created to perform generic
crypto operation, the required complexity hides the basic steps of how to use the cryptodev
APIs.
The following sample code shows the basic steps to encrypt several buffers with AES-CBC
(although performing other crypto operations is similar), using one of the crypto PMDs available
in DPDK.
/*
* Simple example to encrypt several buffers with AES-CBC using
* the Cryptodev APIs.
*/
/* Initialize EAL. */
ret = rte_eal_init(argc, argv);
if (ret < 0)
rte_exit(EXIT_FAILURE, "Invalid EAL arguments\n");
0,
RTE_MBUF_DEFAULT_BUF_SIZE,
socket_id);
if (mbuf_pool == NULL)
rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
/*
* The IV is always placed after the crypto operation,
* so some private data is required to be reserved.
*/
unsigned int crypto_op_private_data = AES_CBC_IV_LENGTH;
/*
* Create session mempool, with two objects per session,
* one for the session header and another one for the
* private session data for the crypto device.
*/
session_pool = rte_mempool_create("session_pool",
MAX_SESSIONS * 2,
session_size,
POOL_CACHE_SIZE,
0, NULL, NULL, NULL,
NULL, socket_id,
0);
if (rte_cryptodev_queue_pair_setup(cdev_id, 0, &qp_conf,
socket_id, session_pool) < 0)
rte_exit(EXIT_FAILURE, "Failed to setup queue pair\n");
if (rte_cryptodev_start(cdev_id) < 0)
rte_exit(EXIT_FAILURE, "Failed to start device\n");
if (rte_cryptodev_sym_session_init(cdev_id, session,
&cipher_xform, session_pool) < 0)
rte_exit(EXIT_FAILURE, "Session could not be initialized "
"for the crypto device\n");
generate_random_bytes(iv_ptr, AES_CBC_IV_LENGTH);
op->sym->cipher.data.offset = 0;
op->sym->cipher.data.length = BUFFER_SIZE;
/*
* Dequeue the crypto operations until all the operations
* are proccessed in the crypto device.
*/
uint16_t num_dequeued_ops, total_num_dequeued_ops = 0;
do {
struct rte_crypto_op *dequeued_ops[BURST_SIZE];
num_dequeued_ops = rte_cryptodev_dequeue_burst(cdev_id, 0,
dequeued_ops, BURST_SIZE);
total_num_dequeued_ops += num_dequeued_ops;
The cryptodev Library API is described in the DPDK API Reference document.
THIRTEEN
SECURITY LIBRARY
The security library provides a framework for management and provisioning of security protocol
operations offloaded to hardware based devices. The library defines generic APIs to create and
free security sessions which can support full protocol offload as well as inline crypto operation
with NIC or crypto devices. The framework currently only supports the IPSec protocol and
associated operations, other protocols will be added in future.
The security library provides an additional offload capability to an existing crypto device and/or
ethernet device.
+---------------+
| rte_security |
+---------------+
\ /
+-----------+ +--------------+
| NIC PMD | | CRYPTO PMD |
+-----------+ +--------------+
Note: Currently, the security library does not support the case of multi-process. It will be
updated in the future releases.
107
Programmer’s Guide, Release 17.11.10
Note: The underlying device may not support crypto processing for all ingress packet match-
ing to a particular flow (e.g. fragmented packets), such packets will be passed as encrypted
packets. It is the responsibility of application to process such encrypted packets using other
crypto driver instance.
Egress Data path - The software prepares the egress packet by adding relevant security pro-
tocol headers. Only the data will not be encrypted by the software. The driver will accordingly
configure the tx descriptors. The hardware device will encrypt the data before sending the
packet out.
will be removed from the packet and the received packet will contains the decrypted packet
only. The driver Rx path checks the descriptors and based on the crypto status sets additional
flags in rte_mbuf.ol_flags field.
Note: The underlying device in this case is stateful. It is expected that the device shall support
crypto processing for all kind of packets matching to a given flow, this includes fragmented
packets (post reassembly). E.g. in case of IPSec the device may internally manage anti-replay
etc. It will provide a configuration option for anti-replay behavior i.e. to drop the packets or
pass them to driver with error flags set in the descriptor.
Egress Data path - The software will send the plain packet without any security protocol head-
ers added to the packet. The driver will configure the security index and other requirement in
tx descriptors. The hardware device will do security processing on the packet that includes
adding the relevant protocol headers and encrypting the data before sending the packet out.
The software should make sure that the buffer has required head room and tail room for any
protocol header addition. The software may also do early fragmentation if the resultant packet
is expected to cross the MTU size.
Note: The underlying device will manage state information required for egress processing.
E.g. in case of IPSec, the seq number will be added to the packet, however the device shall
provide indication when the sequence number is about to overflow. The underlying device may
support post encryption TSO.
Note: In case of IPSec the device may internally manage anti-replay etc. It will provide a
configuration option for anti-replay behavior i.e. to drop the packets or pass them to driver with
error flags set in descriptor.
Encryption: The software will submit the packet to cryptodev as usual for encryption, the hard-
ware device in this case will also add the relevant security protocol header along with encrypt-
ing the packet. The software should make sure that the buffer has required head room and tail
room for any protocol header addition.
Note: In the case of IPSec, the seq number will be added to the packet, It shall provide an
indication when the sequence number is about to overflow.
The device (crypto or ethernet) capabilities which support security operations, are defined
by the security action type, security protocol, protocol capabilities and corresponding crypto
capabilities for security. For the full scope of the Security capability see definition of
rte_security_capability structure in the DPDK API Reference.
struct rte_security_capability;
Each driver (crypto or ethernet) defines its own private array of capabilities for the operations it
supports. Below is an example of the capabilities for a PMD which supports the IPSec protocol.
static const struct rte_security_capability pmd_security_capabilities[] = {
{ /* IPsec Lookaside Protocol offload ESP Tunnel Egress */
.action = RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL,
.protocol = RTE_SECURITY_PROTOCOL_IPSEC,
.ipsec = {
.proto = RTE_SECURITY_IPSEC_SA_PROTO_ESP,
.mode = RTE_SECURITY_IPSEC_SA_MODE_TUNNEL,
.direction = RTE_SECURITY_IPSEC_SA_DIR_EGRESS,
.options = { 0 }
},
.crypto_capabilities = pmd_capabilities
},
{ /* IPsec Lookaside Protocol offload ESP Tunnel Ingress */
.action = RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL,
.protocol = RTE_SECURITY_PROTOCOL_IPSEC,
.ipsec = {
.proto = RTE_SECURITY_IPSEC_SA_PROTO_ESP,
.mode = RTE_SECURITY_IPSEC_SA_MODE_TUNNEL,
.direction = RTE_SECURITY_IPSEC_SA_DIR_INGRESS,
.options = { 0 }
},
.crypto_capabilities = pmd_capabilities
},
{
.action = RTE_SECURITY_ACTION_TYPE_NONE
}
};
static const struct rte_cryptodev_capabilities pmd_capabilities[] = {
{ /* SHA1 HMAC */
.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
.sym = {
.xform_type = RTE_CRYPTO_SYM_XFORM_AUTH,
.auth = {
.algo = RTE_CRYPTO_AUTH_SHA1_HMAC,
.block_size = 64,
.key_size = {
.min = 64,
.max = 64,
.increment = 0
},
.digest_size = {
.min = 12,
.max = 12,
.increment = 0
},
.aad_size = { 0 },
.iv_size = { 0 }
}
}
},
{ /* AES CBC */
.op = RTE_CRYPTO_OP_TYPE_SYMMETRIC,
.sym = {
.xform_type = RTE_CRYPTO_SYM_XFORM_CIPHER,
.cipher = {
.algo = RTE_CRYPTO_CIPHER_AES_CBC,
.block_size = 16,
.key_size = {
.min = 16,
.max = 32,
.increment = 8
},
.iv_size = {
.min = 16,
.max = 16,
.increment = 0
}
}
}
}
}
Discovering the features and capabilities of a driver (crypto/ethernet) is achieved through the
rte_security_capabilities_get() function.
const struct rte_security_capability *rte_security_capabilities_get(uint16_t id);
This allows the user to query a specific driver and get all device security capabilities. It returns
an array of rte_security_capability structures which contains all the capabilities for that
device.
Security Sessions are created to store the immutable fields of a particular Security Association
for a particular protocol which is defined by a security session configuration structure which
is used in the operation processing of a packet flow. Sessions are used to manage protocol
specific information as well as crypto parameters. Security sessions cache this immutable data
in a optimal way for the underlying PMD and this allows further acceleration of the offload of
Crypto workloads.
The Security framework provides APIs to create and free sessions for crypto/ethernet devices,
where sessions are mempool objects. It is the application’s responsibility to create and manage
the session mempools. The mempool object size should be able to accommodate the driver’s
private data of security session.
Once the session mempools have been created, rte_security_session_create() is
used to allocate and initialize a session for the required crypto/ethernet device.
Session APIs need a parameter rte_security_ctx to identify the crypto/ethernet security
ops. This parameter can be retrieved using the APIs rte_cryptodev_get_sec_ctx() (for
crypto device) or rte_eth_dev_get_sec_ctx (for ethernet port).
Sessions already created can be updated with rte_security_session_update().
The configuration structure reuses the rte_crypto_sym_xform struct for crypto related con-
figuration. The rte_security_session_action_type struct is used to specify whether
the session is configured for Lookaside Protocol offload or Inline Crypto or Inline Protocol Of-
fload.
enum rte_security_session_action_type {
RTE_SECURITY_ACTION_TYPE_NONE,
/**< No security actions */
RTE_SECURITY_ACTION_TYPE_INLINE_CRYPTO,
/**< Crypto processing for security protocol is processed inline
* during transmission */
RTE_SECURITY_ACTION_TYPE_INLINE_PROTOCOL,
/**< All security protocol processing is performed inline during
* transmission */
RTE_SECURITY_ACTION_TYPE_LOOKASIDE_PROTOCOL
/**< All security protocol processing including crypto is performed
* on a lookaside accelerator */
};
Currently the library defines configuration parameters for IPSec only. For other protocols like
MACSec, structures and enums are defined as place holders which will be updated in the
future.
The rte_security Library API is described in the DPDK API Reference document.
In the case of NIC based offloads, the security session specified in the
‘rte_flow_action_security’ must be created on the same port as the flow action that is
being specified.
The ingress/egress flow attribute should match that specified in the security session if the
security session supports the definition of the direction.
Multiple flows can be configured to use the same security session. For example if the security
session specifies an egress IPsec SA, then multiple flows can be specified to that SA. In the
case of an ingress IPsec SA then it is only valid to have a single flow to map to that security
session.
Configuration Path
|
+--------|--------+
| Add/Remove |
| IPsec SA | <------ Build security flow action of
| | | ipsec transform
|--------|--------|
|
+--------V--------+
| Flow API |
+--------|--------+
|
+--------V--------+
| |
| NIC PMD | <------ Add/Remove SA to/from hw context
| |
+--------|--------+
|
+--------|--------+
| HW ACCELERATED |
| NIC |
| |
+--------|--------+
However, the API can represent, IPsec crypto offload with any encapsulation:
+-------+ +--------+ +-----+
| Eth | -> ... -> | ESP | -> | END |
+-------+ +--------+ +-----+
FOURTEEN
In addition to Poll Mode Drivers (PMDs) for physical and virtual hardware, DPDK also includes
a pure-software library that allows physical PMD’s to be bonded together to create a single
logical PMD.
User Application
DPDK
bonded ethdev
ethdev
port ethdev
port ethdev
port ethdev
port ethdev
port
Note: The Link Bonding PMD Library is enabled by default in the build configuration files, the
library can be disabled by setting CONFIG_RTE_LIBRTE_PMD_BOND=n and recompiling the
DPDK.
Currently the Link Bonding PMD library supports following modes of operation:
116
Programmer’s Guide, Release 17.11.10
User Application
5
4
3
2
DPDK
1
bonded ethdev
4 5
1 2 3
Note: The coloring differences of the packets are used to identify different flow classification
calculated by the selected transmit policy
The librte_pmd_bond bonded device are compatible with the Ethernet device API exported by
the Ethernet PMDs described in the DPDK API Reference.
The Link Bonding Library supports the creation of bonded devices at application startup time
during EAL initialization using the --vdev option as well as programmatically via the C API
rte_eth_bond_create function.
User Application
3
2
DPDK
1
bonded ethdev
3
2
1
Bonded devices support the dynamical addition and removal of slave devices using the
rte_eth_bond_slave_add / rte_eth_bond_slave_remove APIs.
After a slave device is added to a bonded device slave is stopped using rte_eth_dev_stop
and then reconfigured using rte_eth_dev_configure the RX and TX queues are also re-
configured using rte_eth_tx_queue_setup / rte_eth_rx_queue_setup with the pa-
rameters use to configure the bonding device. If RSS is enabled for bonding device, this mode
is also enabled on new slave and configured as well.
Setting up multi-queue mode for bonding device to RSS, makes it fully RSS-capable, so all
slaves are synchronized with its configuration. This mode is intended to provide RSS configu-
ration on slaves transparent for client application implementation.
Bonding device stores its own version of RSS settings i.e. RETA, RSS hash function and RSS
key, used to set up its slaves. That let to define the meaning of RSS configuration of bonding
device as desired configuration of whole bonding (as one unit), without pointing any of slave
inside. It is required to ensure consistency and made it more error-proof.
RSS hash function set for bonding device, is a maximal set of RSS hash functions supported
by all bonded slaves. RETA size is a GCD of all its RETA’s sizes, so it can be easily used as
a pattern providing expected behavior, even if slave RETAs’ sizes are different. If RSS Key is
not set for bonded device, it’s not changed on the slaves and default key for device is used.
All settings are managed through the bonding port API and always are propagated in one
direction (from bonding to slaves).
User Application
6
5
4
3
2 DPDK
1
bonded ethdev
5
3 4
1 2 6
User Application
3
2
DPDK
1
bonded ethdev
3 3 3
2 2 2
1 1 1
Link bonding devices support the registration of a link status change callback, using the
rte_eth_dev_callback_register API, this will be called when the status of the bond-
ing device changes. For example in the case of a bonding device which has 3 slaves, the link
status will change to up when one slave becomes active or change to down when all slaves
become inactive. There is no callback notification when a single slave changes state and the
previous conditions are not met. If a user wishes to monitor individual slaves then they must
register callbacks with that slave directly.
The link bonding library also supports devices which do not implement link status change
interrupts, this is achieved by polling the devices link status at a defined period which is
set using the rte_eth_bond_link_monitoring_set API, the default polling interval is
10ms. When a device is added as a slave to a bonding device it is determined using the
RTE_PCI_DRV_INTR_LSC flag whether the device supports interrupts or whether the link sta-
tus should be monitored by polling it.
The current implementation only supports devices that support the same speed and duplex to
be added as a slaves to the same bonded device. The bonded device inherits these attributes
from the first active slave added to the bonded device and then all further slaves added to the
bonded device must support these parameters.
A bonding device must have a minimum of one slave before the bonding device itself can be
started.
To use a bonding device dynamic RSS configuration feature effectively, it is also required, that
all slaves should be RSS-capable and support, at least one common hash function available
User Application
6
5
4
3
2 DPDK
1
bonded ethdev
5
3 O
O 4 6
1 2 O
User Application
12003
5006
5005
0002
0001 DPDK
bonded ethdev
0002 5006
0001 5005 12003
for each of them. Changing RSS key is only possible, when all slave devices support the same
key size.
To prevent inconsistency on how slaves process packets, once a device is added to a bonding
device, RSS configuration should be managed through the bonding device API, and not directly
on the slave.
Like all other PMD, all functions exported by a PMD are lock-free functions that are assumed
not to be invoked in parallel on different logical cores to work on the same target object.
It should also be noted that the PMD receive function should not be invoked directly on a slave
devices after they have been to a bonded device since packets read directly from the slave
device will no longer be available to the bonded device to read.
14.2.3 Configuration
Link bonding devices are created using the rte_eth_bond_create API which requires a
unique device name, the bonding mode, and the socket Id to allocate the bonding device’s
resources on. The other configurable parameters for a bonded device are its slave devices, its
primary slave, a user defined MAC address and transmission policy to use if the device is in
balance XOR mode.
Slave Devices
Primary Slave
The primary slave is used to define the default port to use when a bonded device is in active
backup mode. A different port will only be used if, and only if, the current primary port goes
down. If the user does not specify a primary port it will default to being the first port added to
the bonded device.
MAC Address
The bonded device can be configured with a user specified MAC address, this address will be
inherited by the some/all slave devices depending on the operating mode. If the device is in
active backup mode then only the primary device will have the user specified MAC, all other
slaves will retain their original MAC address. In mode 0, 2, 3, 4 all slaves devices are configure
with the bonded devices MAC address.
If a user defined MAC address is not defined then the bonded device will default to using the
primary slaves MAC address.
There are 3 supported transmission policies for bonded device running in Balance XOR mode.
Layer 2, Layer 2+3, Layer 3+4.
• Layer 2: Ethernet MAC address based balancing is the default transmission policy for
Balance XOR bonding mode. It uses a simple XOR calculation on the source MAC
address and destination MAC address of the packet and then calculate the modulus of
this value to calculate the slave device to transmit the packet on.
• Layer 2 + 3: Ethernet MAC address & IP Address based balancing uses a combination of
source/destination MAC addresses and the source/destination IP addresses of the data
packet to decide which slave port the packet will be transmitted on.
• Layer 3 + 4: IP Address & UDP Port based balancing uses a combination of
source/destination IP Address and the source/destination UDP ports of the packet of
the data packet to decide which slave port the packet will be transmitted on.
All these policies support 802.1Q VLAN Ethernet packets, as well as IPv4, IPv6 and UDP
protocols for load balancing.
The librte_pmd_bond library supports two modes of device creation, the libraries export full C
API or using the EAL command line to statically configure link bonding devices at application
startup. Using the EAL option it is possible to use link bonding functionality transparently
without specific knowledge of the libraries API, this can be used, for example, to add bonding
functionality, such as active backup, to an existing application which has no knowledge of the
link bonding C API.
Using the librte_pmd_bond libraries API it is possible to dynamically create and manage
link bonding device from within any application. Link bonding devices are created using the
rte_eth_bond_create API which requires a unique device name, the link bonding mode
to initial the device in and finally the socket Id which to allocate the devices resources onto.
After successful creation of a bonding device it must be configured using the generic Ethernet
device configure API rte_eth_dev_configure and then the RX and TX queues which will
be used must be setup using rte_eth_tx_queue_setup / rte_eth_rx_queue_setup.
Slave devices can be dynamically added and removed from a link bonding device us-
ing the rte_eth_bond_slave_add / rte_eth_bond_slave_remove APIs but at least
one slave device must be added to the link bonding device before it can be started using
rte_eth_dev_start.
The link status of a bonded device is dictated by that of its slaves, if all slave device link status
are down or if all slaves are removed from the link bonding device then the link status of the
bonding device will go down.
It is also possible to configure / query the configuration of the control param-
eters of a bonded device using the provided APIs rte_eth_bond_mode_set/
get, rte_eth_bond_primary_set/get, rte_eth_bond_mac_set/reset and
rte_eth_bond_xmit_policy_set/get.
14.3.2 Using Link Bonding Devices from the EAL Command Line
Link bonding devices can be created at application startup time using the --vdev EAL com-
mand line option. The device name must start with the net_bonding prefix followed by numbers
or letters. The name must be unique for each device. Each device can have multiple options
arranged in a comma separated list. Multiple devices definitions can be arranged by calling the
--vdev option multiple times.
Device names and bonding options must be separated by commas as shown below:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,bond_opt0=..,bond opt1=..'--vdev 'net_
There are multiple ways of definitions that can be assessed and combined as long as the
following two rules are respected:
• A unique device name, in the format of net_bondingX is provided, where X can be any
combination of numbers and/or letters, and the name is no greater than 32 characters
long.
• A least one slave device is provided with for each bonded device definition.
• The operation mode of the bonded device being created is provided.
The different options are:
• mode: Integer value defining the bonding mode of the device. Currently supports modes
0,1,2,3,4,5 (round-robin, active backup, balance, broadcast, link aggregation, transmit
load balancing).
mode=2
• slave: Defines the PMD device which will be added as slave to the bonded de-
vice. This option can be selected multiple times, for each device to be added as a
slave. Physical devices should be specified using their PCI address, in the format do-
main:bus:devid.function
slave=0000:0a:00.0,slave=0000:0a:00.1
• primary: Optional parameter which defines the primary slave port, is used in active
backup mode to select the primary slave for data TX/RX if it is available. The primary
port also is used to select the MAC address to use when it is not defined by the user.
This defaults to the first slave added to the device if it is specified. The primary device
must be a slave of the bonded device.
primary=0000:0a:00.0
• socket_id: Optional parameter used to select which socket on a NUMA device the bonded
devices resources will be allocated on.
socket_id=0
• mac: Optional parameter to select a MAC address for link bonding device, this overrides
the value of the primary slave device.
mac=00:1e:67:1d:fd:1d
• xmit_policy: Optional parameter which defines the transmission policy when the bonded
device is in balance mode. If not user specified this defaults to l2 (layer 2) forwarding, the
other transmission policies available are l23 (layer 2+3) and l34 (layer 3+4)
xmit_policy=l23
Examples of Usage
Create a bonded device in round robin mode with two slaves specified by their PCI address:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,slave=0000:0a:00.01,slave=0000:
Create a bonded device in round robin mode with two slaves specified by their PCI address
and an overriding MAC address:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=0,slave=0000:0a:00.01,slave=0000:
Create a bonded device in active backup mode with two slaves specified, and a primary slave
specified by their PCI addresses:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=1,slave=0000:0a:00.01,slave=0000:
Create a bonded device in balance mode with two slaves specified by their PCI addresses,
and a transmission policy of layer 3 + 4 forwarding:
$RTE_TARGET/app/testpmd -l 0-3 -n 4 --vdev 'net_bonding0,mode=2,slave=0000:0a:00.01,slave=0000:
FIFTEEN
TIMER LIBRARY
The Timer library provides a timer service to DPDK execution units to enable execution of
callback functions asynchronously. Features of the library are:
• Timers can be periodic (multi-shot) or single (one-shot).
• Timers can be loaded from one core and executed on another. It has to be specified in
the call to rte_timer_reset().
• Timers provide high precision (depends on the call frequency to rte_timer_manage() that
checks timer expiration for the local core).
• If not required in the application, timers can be disabled at compilation time by not calling
the rte_timer_manage() to increase performance.
The timer library uses the rte_get_timer_cycles() function that uses the High Precision Event
Timer (HPET) or the CPUs Time Stamp Counter (TSC) to provide a reliable time reference.
This library provides an interface to add, delete and restart a timer. The API is based on BSD
callout() with a few differences. Refer to the callout manual.
Timers are tracked on a per-lcore basis, with all pending timers for a core being maintained
in order of timer expiry in a skiplist data structure. The skiplist used has ten levels and each
entry in the table appears in each level with probability ¼^level. This means that all entries are
present in level 0, 1 in every 4 entries is present at level 1, one in every 16 at level 2 and so on
up to level 9. This means that adding and removing entries from the timer list for a core can be
done in log(n) time, up to 4^10 entries, that is, approximately 1,000,000 timers per lcore.
A timer structure contains a special field called status, which is a union of a timer state
(stopped, pending, running, config) and an owner (lcore id). Depending on the timer state,
we know if a timer is present in a list or not:
• STOPPED: no owner, not in a list
• CONFIG: owned by a core, must not be modified by another core, maybe in a list or not,
depending on previous state
• PENDING: owned by a core, present in a list
• RUNNING: owned by a core, must not be modified by another core, present in a list
127
Programmer’s Guide, Release 17.11.10
Resetting or stopping a timer while it is in a CONFIG or RUNNING state is not allowed. When
modifying the state of a timer, a Compare And Swap instruction should be used to guarantee
that the status (state+owner) is modified atomically.
Inside the rte_timer_manage() function, the skiplist is used as a regular list by iterating along
the level 0 list, which contains all timer entries, until an entry which has not yet expired has
been encountered. To improve performance in the case where there are entries in the timer
list but none of those timers have yet expired, the expiry time of the first list entry is maintained
within the per-core timer list structure itself. On 64-bit platforms, this value can be checked
without the need to take a lock on the overall structure. (Since expiry times are maintained
as 64-bit values, a check on the value cannot be done on 32-bit platforms without using either
a compare-and-swap (CAS) instruction or using a lock, so this additional check is skipped in
favor of checking as normal once the lock has been taken.) On both 64-bit and 32-bit platforms,
a call to rte_timer_manage() returns without taking a lock in the case where the timer list for
the calling core is empty.
The timer library is used for periodic calls, such as garbage collectors, or some state machines
(ARP, bridging, and so on).
15.3 References
• callout manual - The callout facility that provides timers with a mechanism to execute a
function at a given time.
• HPET - Information about the High Precision Event Timer (HPET).
SIXTEEN
HASH LIBRARY
The DPDK provides a Hash Library for creating hash table for fast lookup. The hash table is
a data structure optimized for searching through a set of entries that are each identified by a
unique key. For increased performance the DPDK Hash requires that all the keys have the
same number of bytes which is set at the hash creation time.
129
Programmer’s Guide, Release 17.11.10
• Combination of the two options above: User can provide key, precomputed hash and
data.
Also, the API contains a method to allow the user to look up entries in bursts, achieving higher
performance than looking up individual entries, as the function prefetches next entries at the
time it is operating with the first ones, which reduces significantly the impact of the necessary
memory accesses. Notice that this method uses a pipeline of 8 entries (4 stages of 2 entries),
so it is highly recommended to use at least 8 entries per burst.
The actual data associated with each key can be either managed by the user using a separate
table that mirrors the hash in terms of number of entries and position of each entry, as shown
in the Flow Classification use case describes in the following sections, or stored in the hash
table itself.
The example hash tables in the L2/L3 Forwarding sample applications defines which port to
forward a packet to based on a packet flow identified by the five-tuple lookup. However, this
table could also be used for more sophisticated features and provide many other functions and
actions that could be performed on the packets and flows.
The hash library can be used in a multi-process environment, minding that only lookups
are thread-safe. The only function that can only be used in single-process mode is
rte_hash_set_cmp_func(), which sets up a custom compare function, which is assigned to
a function pointer (therefore, it is not supported in multi-process mode).
against a key from the bucket can take significantly more time than comparing the 4-byte sig-
nature of the input key against the signature of a key from the bucket. Therefore, the signature
comparison is done first and the full key comparison done only when the signatures matches.
The full key comparison is still necessary, as two input keys from the same bucket can still
potentially have the same 4-byte hash signature, although this event is relatively rare for hash
functions providing good uniform distributions for the set of input keys.
Example of lookup:
First of all, the primary bucket is identified and entry is likely to be stored there. If signature
was stored there, we compare its key against the one provided and return the position where
it was stored and/or the data associated to that key if there is a match. If signature is not in
the primary bucket, the secondary bucket is looked up, where same procedure is carried out.
If there is no match there either, key is considered not to be in the table.
Example of addition:
Like lookup, the primary and secondary buckets are identified. If there is an empty slot in the
primary bucket, primary and secondary signatures are stored in that slot, key and data (if any)
are added to the second table and an index to the position in the second table is stored in
the slot of the first table. If there is no space in the primary bucket, one of the entries on that
bucket is pushed to its alternative location, and the key to be added is inserted in its position.
To know where the alternative bucket of the evicted entry is, the secondary signature is looked
up and alternative bucket index is calculated from doing the modulo, as seen above. If there is
room in the alternative bucket, the evicted entry is stored in it. If not, same process is repeated
(one of the entries gets pushed) until a non full bucket is found. Notice that despite all the
entry movement in the first table, the second table is not touched, which would impact greatly
in performance.
In the very unlikely event that table enters in a loop where same entries are being evicted
indefinitely, key is considered not able to be stored. With random keys, this method allows the
user to get around 90% of the table utilization, without having to drop any stored entry (LRU)
or allocate more memory (extended buckets).
As mentioned above, Cuckoo hash implementation pushes elements out of their bucket, if there
is a new entry to be added which primary location coincides with their current bucket, being
pushed to their alternative location. Therefore, as user adds more entries to the hash table,
distribution of the hash values in the buckets will change, being most of them in their primary
location and a few in their secondary location, which the later will increase, as table gets busier.
This information is quite useful, as performance may be lower as more entries are evicted to
their secondary location.
See the tables below showing example entry distribution as table utilization increases.
Note: Last values on the tables above are the average maximum table utilization with random
keys and using Jenkins hash function.
Flow classification is used to map each input packet to the connection/flow it belongs to. This
operation is necessary as the processing of each input packet is usually done in the context
of their connection, so the same set of operations is applied to all the packets from the same
flow.
Applications using flow classification typically have a flow table to manage, with each separate
flow having an entry associated with it in this table. The size of the flow table entry is application
specific, with typical values of 4, 16, 32 or 64 bytes.
Each application using flow classification typically has a mechanism defined to uniquely iden-
tify a flow based on a number of fields read from the input packet that make up the flow key.
One example is to use the DiffServ 5-tuple made up of the following fields of the IP and trans-
port layer packet headers: Source IP Address, Destination IP Address, Protocol, Source Port,
Destination Port.
The DPDK hash provides a generic method to implement an application specific flow classifi-
cation mechanism. Given a flow table implemented as an array, the application should create
a hash object with the same number of entries as the flow table and with the hash key size set
to the number of bytes in the selected flow key.
The flow table operations on the application side are described below:
• Add flow: Add the flow key to hash. If the returned position is valid, use it to access the
flow entry in the flow table for adding a new flow or updating the information associated
with an existing flow. Otherwise, the flow addition failed, for example due to lack of free
entries for storing new flows.
• Delete flow: Delete the flow key from the hash. If the returned position is valid, use it to
access the flow entry in the flow table to invalidate the information associated with the
flow.
• Lookup flow: Lookup for the flow key in the hash. If the returned position is valid (flow
lookup hit), use the returned position to access the flow entry in the flow table. Otherwise
(flow lookup miss) there is no flow registered for the current packet.
16.6 References
• Donald E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching
(2nd Edition), 1998, Addison-Wesley Professional
SEVENTEEN
17.1 Introduction
In Data Centers today, clustering and scheduling of distributed workloads is a very common
task. Many workloads require a deterministic partitioning of a flat key space among a cluster
of machines. When a packet enters the cluster, the ingress node will direct the packet to its
handling node. For example, data-centers with disaggregated storage use storage metadata
tables to forward I/O requests to the correct back end storage cluster, stateful packet inspection
will use match incoming flows to signatures in flow tables to send incoming packets to their
intended deep packet inspection (DPI) devices, and so on.
EFD is a distributor library that uses perfect hashing to determine a target/value for a given
incoming flow key. It has the following advantages: first, because it uses perfect hashing
it does not store the key itself and hence lookup performance is not dependent on the key
size. Second, the target/value can be any arbitrary value hence the system designer and/or
operator can better optimize service rates and inter-cluster network traffic locating. Third,
since the storage requirement is much smaller than a hash-based flow table (i.e. better fit for
CPU cache), EFD can scale to millions of flow keys. Finally, with the current optimized library
implementation, performance is fully scalable with any number of CPU cores.
Flow distribution and/or load balancing can be simply done using a stateless computation, for
instance using round-robin or a simple computation based on the flow key as an input. For
example, a hash function can be used to direct a certain flow to a target based on the flow key
(e.g. h(key) mod n) where h(key) is the hash value of the flow key and n is the number of
possible targets.
In this scheme (Fig. 17.1), the front end server/distributor/load balancer extracts the flow key
from the input packet and applies a computation to determine where this flow should be di-
rected. Intuitively, this scheme is very simple and requires no state to be kept at the front end
node, and hence, storage requirements are minimum.
A widely used flow distributor that belongs to the same category of computation-based
schemes is consistent hashing, shown in Fig. 17.2. Target destinations (shown in red)
are hashed into the same space as the flow keys (shown in blue), and keys are mapped to the
nearest target in a clockwise fashion. Dynamically adding and removing targets with consistent
hashing requires only K/n keys to be remapped on average, where K is the number of keys,
134
Programmer’s Guide, Release 17.11.10
Target 1
Target 2
LB
Target N
Keys
Target
Value
Hashed
When using a Flow-Table based scheme to handle flow distribution/load balancing, in contrast
with computation-based schemes, the system designer has the flexibility of assigning a given
flow to any given target. The flow table (e.g. DPDK RTE Hash Library) will simply store both
the flow key and the target value.
create
Hash function
a flow table
is used
index
to
H(..)
Key NAction N
matched
Retrieved
with
keys
input
arekey Key x Key y Key z
Action
As shown in Fig. 17.3, when doing a lookup, the flow-table is indexed with the hash of the flow
key and the keys (more than one is possible, because of hash collision) stored in this index
and corresponding values are retrieved. The retrieved key(s) is matched with the input flow key
and if there is a match the value (target id) is returned.
The drawback of using a hash table for flow distribution/load balancing is the storage require-
ment, since the flow table need to store keys, signatures and target values. This doesn’t allow
this scheme to scale to millions of flow keys. Large tables will usually not fit in the CPU cache,
and hence, the lookup performance is degraded because of the latency to access the main
memory.
EFD combines the advantages of both flow-table based and computation-based schemes.
It doesn’t require the large storage necessary for flow-table based schemes (because EFD
doesn’t store the key as explained below), and it supports any arbitrary value for any given key.
Target
Value
H1(x) …..Hm(x)
H2(x)
Key
Key
... 28
2
1 01 0 10 1
0
The basic idea of EFD is when a given key is to be inserted, a family of hash functions is
searched until the correct hash function that maps the input key to the correct value is found, as
shown in Fig. 17.4. However, rather than explicitly storing all keys and their associated values,
EFD stores only indices of hash functions that map keys to values, and thereby consumes
much less space than conventional flow-based tables. The lookup operation is very simple,
similar to a computational-based scheme: given an input key the lookup operation is reduced
to hashing that key with the correct hash function.
All Keys
Intuitively, finding a hash function that maps each of a large number (millions) of input keys
to the correct output value is effectively impossible, as a result EFD, as shown in Fig. 17.5,
breaks the problem into smaller pieces (divide and conquer). EFD divides the entire input key
set into many small groups. Each group consists of approximately 20-28 keys (a configurable
parameter for the library), then, for each small group, a brute force search to find a hash
function that produces the correct outputs for each key in the group.
It should be mentioned that, since the online lookup table for EFD doesn’t store the key itself,
the size of the EFD table is independent of the key size and hence EFD lookup performance
which is almost constant irrespective of the length of the key which is a highly desirable feature
especially for longer keys.
In summary, EFD is a set separation data structure that supports millions of keys. It is used to
distribute a given key to an intended target. By itself EFD is not a FIB data structure with an
exact match the input flow key.
EFD can be used along the data path of many network functions and middleboxes. As previ-
ously mentioned, it can used as an index table for <key,value> pairs, meta-data for objects, a
flow-level load balancer, etc. Fig. 17.6 shows an example of using EFD as a flow-level load
balancer, where flows are received at a front end server before being forwarded to the target
back end server for processing. The system designer would deterministically co-locate flows
together in order to minimize cross-server interaction. (For example, flows requesting certain
webpage objects are co-located together, to minimize forwarding of common objects across
servers).
Backend Server 1
Key xAction xKey yAction yKey z Action z
or
Frontend
Load Balancer
Server
Key NAction N
Supports N Flows
Backend Server 2
Key NAction N
Backend Server X
Supports X*N Flows
As shown in Fig. 17.6, the front end server will have an EFD table that stores for each group
what is the perfect hash index that satisfies the correct output. Because the table size is small
and fits in cache (since keys are not stored), it sustains a large number of flows (N*X, where N
is the maximum number of flows served by each back end server of the X possible targets).
With an input flow key, the group id is computed (for example, using last few bits of CRC hash)
and then the EFD table is indexed with the group id to retrieve the corresponding hash index to
use. Once the index is retrieved the key is hashed using this hash function and the result will
be the intended correct target where this flow is supposed to be processed.
It should be noted that as a result of EFD not matching the exact key but rather distributing
the flows to a target back end node based on the perfect hash index, a key that has not been
inserted before will be distributed to a valid target. Hence, a local table which stores the flows
served at each node is used and is exact matched with the input key to rule out new never
seen before flows.
The EFD library API is created with a very similar semantics of a hash-index or a flow table.
The application creates an EFD table for a given maximum number of flows, a function is called
to insert a flow key with a specific target value, and another function is used to retrieve target
values for a given individual flow key or a bulk of keys.
The function rte_efd_create() is used to create and return a pointer to an EFD table that
is sized to hold up to num_flows key. The online version of the EFD table (the one that does
not store the keys and is used for lookups) will be allocated and created in the last level cache
(LLC) of the socket defined by the online_socket_bitmask, while the offline EFD table (the
one that stores the keys and is used for key inserts and for computing the perfect hashing) is
allocated and created in the LLC of the socket defined by offline_socket_bitmask. It should
be noted, that for highest performance the socket id should match that where the thread is
running, i.e. the online EFD lookup table should be created on the same socket as where the
lookup thread is running.
The EFD function to insert a key or update a key to a new value is rte_efd_update().
This function will update an existing key to a new value (target) if the key has already been
inserted before, or will insert the <key,value> pair if this key has not been inserted before. It
will return 0 upon success. It will return EFD_UPDATE_WARN_GROUP_FULL (1) if the op-
eration is insert, and the last available space in the key’s group was just used. It will return
EFD_UPDATE_FAILED (2) when the insertion or update has failed (either it failed to find a
suitable perfect hash or the group was full). The function will return EFD_UPDATE_NO_CHANGE
(3) if there is no change to the EFD table (i.e, same value already exists).
Note: This function is not multi-thread safe and should only be called from one thread.
To lookup a certain key in an EFD table, the function rte_efd_lookup() is used to return the
value associated with single key. As previously mentioned, if the key has been inserted, the cor-
rect value inserted is returned, if the key has not been inserted before, a ‘random’ value (based
on hashing of the key) is returned. For better performance and to decrease the overhead of
function calls per key, it is always recommended to use a bulk lookup function (simultaneous
lookup of multiple keys) instead of a single key lookup function. rte_efd_lookup_bulk()
is the bulk lookup function, that looks up num_keys simultaneously stored in the key_list and
the corresponding return values will be returned in the value_list.
Note: This function is multi-thread safe, but there should not be other threads writing in the
EFD table, unless locks are used.
To delete a certain key in an EFD table, the function rte_efd_delete() can be used. The
function returns zero upon success when the key has been found and deleted. Socket_id is
the parameter to use to lookup the existing value, which is ideally the caller’s socket id. The
previous value associated with this key will be returned in the prev_value argument.
Note: This function is not multi-thread safe and should only be called from one thread.
This section provides the brief high-level idea and an overview of the library internals to accom-
pany the RFC. The intent of this section is to explain to readers the high-level implementation
of insert, lookup and group rebalancing in the EFD library.
As previously mentioned the EFD divides the whole set of keys into groups of a manageable
size (e.g. 28 keys) and then searches for the perfect hash that satisfies the intended target
value for each key. EFD stores two version of the <key,value> table:
• Offline Version (in memory): Only used for the insertion/update operation, which is less
frequent than the lookup operation. In the offline version the exact keys for each group is
stored. When a new key is added, the hash function is updated that will satisfy the value
for the new key together with the all old keys already inserted in this group.
• Online Version (in cache): Used for the frequent lookup operation. In the online version,
as previously mentioned, the keys are not stored but rather only the hash index for each
group.
Fig. 17.7 depicts the group assignment for 7 flow keys as an example. Given a flow key, a hash
function (in our implementation CRC hash) is used to get the group id. As shown in the figure,
the groups can be unbalanced. (We highlight group rebalancing further below).
Focusing on one group that has four keys, Fig. 17.8 depicts the search algorithm to find the
perfect hash function. Assuming that the target value bit for the keys is as shown in the figure,
then the online EFD table will store a 16 bit hash index and 16 bit lookup table per group per
value bit.
For a given keyX, a hash function (h(keyX,seed1) + index * h(keyX,seed2)) is used
to point to certain bit index in the 16bit lookup_table value, as shown in Fig. 17.9. The insert
function will brute force search for all possible values for the hash index until a non conflicting
lookup_table is found.
Group
Identifier
(simplified)
0x0102 0x0103 0x0104 - · Groups
Keys
on
keys
some
separated
Keys(<28)
contain
bits from
separateda
into
small
groups
hash.
into number
based
of
groups based on
Groups
some bits from hash
group id4 2 1
- Groups contain a
in
Total
group
# ofsokeys
far small number of
keys (<28)
Fig. 17.8: Perfect Hash Search - Assigned Keys & Target Value
lookup_tablebit
CRC32 (32 Lookup Table has
Goal: Find a valid index for key1
bit output) 16 bits
Key1: Value = 0 hash_index
lookup_tablebit
Key3: Value = 1 index for key3
(hash(key, seed1) + hash_index * hash(key, seed2)) % 16
Key4: Value = 0 lookup_tablebit
index for key4
Key7: Value = 1
CRC32 (32 lookup_tablebit
bit output) index for key7
Lookup_table
Values (16 bits)
For example, since both key3 and key7 have a target bit value of 1, it is okay if the hash function
of both keys point to the same bit in the lookup table. A conflict will occur if a hash index is
used that maps both Key4 and Key7 to the same index in the lookup_table, as shown in Fig.
17.10, since their target value bit are not the same. Once a hash index is found that produces
a lookup_table with no contradictions, this index is stored for this group. This procedure is
repeated for each bit of target value.
The design principle of EFD is that lookups are much more frequent than inserts, and hence,
EFD’s design optimizes for the lookups which are faster and much simpler than the slower
insert procedure (inserts are slow, because of perfect hash search as previously discussed).
F(Key,38123
hash_index =
Key Position = 6
hash
Apply the equation
retrieve
Apply
the the
the
lookup_table
bit
equation
positiontoin
to retrieve the bit
0x0102ABCD
position in the
lookup_table
(Hash(key,seed1)+38123*hash(key,seed2))%16
Group ID: 0x0102
the
Retrieve
specified
lookup
the value
location
table
“0'infrom
the
hash_index =
38123
lookup_table =
0110 1100 0101 1101
Fig. 17.11 depicts the lookup operation for EFD. Given an input key, the group id is computed
(using CRC hash) and then the hash index for this group is retrieved from the EFD table. Using
the retrieved hash index, the hash function h(key,seed1) + index *h(key,seed2) is
used which will result in an index in the lookup_table, the bit corresponding to this index will be
the target value bit. This procedure is repeated for each bit of the target value.
When discussing EFD inserts and lookups, the discussion is simplified by assuming that a
group id is simply a result of hash function. However, since hashing in general is not perfect
and will not always produce a uniform output, this simplified assumption will lead to unbalanced
groups, i.e., some group will have more keys than other groups. Typically, and to minimize in-
sert time with an increasing number of keys, it is preferable that all groups will have a balanced
number of keys, so the brute force search for the perfect hash terminates with a valid hash
index. In order to achieve this target, groups are rebalanced during runtime inserts, and keys
are moved around from a busy group to a less crowded group as the more keys are inserted.
Chunks Bins Groups
0 0 4
1
Insert key 2 3+1
3
1 4 0
hash 0 0
5
1 2
6 7
0x0102ABCD 7 2 2 99
4
… 8 3
4 2 5-3
9
4 9
10 1 5 10
11 6 98
bin id variable 1+4
12 7 97
# of
chunks … …
(power 255 5
chunk id of 2) 64 96
6
Move bin from group 1 to 4
Fig. 17.12 depicts the high level idea of group rebalancing, given an input key the hash result
is split into two parts a chunk id and 8-bit bin id. A chunk contains 64 different groups and 256
bins (i.e. for any given bin it can map to 4 distinct groups). When a key is inserted, the bin
id is computed, for example in Fig. 17.12 bin_id=2, and since each bin can be mapped to one
of four different groups (2 bit storage), the four possible mappings are evaluated and the one
that will result in a balanced key distribution across these four is selected the mapping result is
stored in these two bits.
17.6 References
1- EFD is based on collaborative research work between Intel and Carnegie Mel-
lon University (CMU), interested readers can refer to the paper “Scaling Up Clus-
tered Network Appliances with ScaleBricks;” Dong Zhou et al. at SIGCOMM 2015
(http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p241.pdf ) for more information.
EIGHTEEN
MEMBERSHIP LIBRARY
18.1 Introduction
The DPDK Membership Library provides an API for DPDK applications to insert a new member,
delete an existing member, or query the existence of a member in a given set, or a group of
sets. For the case of a group of sets, the library will return not only whether the element has
been inserted before in one of the sets but also which set it belongs to. The Membership Library
is an extension and generalization of a traditional filter structure (for example Bloom Filter
[Member-bloom]) that has multiple usages in a wide variety of workloads and applications. In
general, the Membership Library is a data structure that provides a “set-summary” on whether
a member belongs to a set, and as discussed in detail later, there are two advantages of using
such a set-summary rather than operating on a “full-blown” complete list of elements: first, it
has a much smaller storage requirement than storing the whole list of elements themselves,
and secondly checking an element membership (or other operations) in this set-summary is
much faster than checking it for the original full-blown complete list of elements.
We use the term “Set-Summary” in this guide to refer to the space-efficient, probabilistic mem-
bership data structure that is provided by the library. A membership test for an element will
return the set this element belongs to or that the element is “not-found” with very high prob-
ability of accuracy. Set-summary is a fundamental data aggregation component that can be
used in many network (and other) applications. It is a crucial structure to address performance
and scalability issues of diverse network applications including overlay networks, data-centric
networks, flow table summaries, network statistics and traffic monitoring. A set-summary is
useful for applications who need to include a list of elements while a complete list requires
too much space and/or too much processing cost. In these situations, the set-summary works
as a lossy hash-based representation of a set of members. It can dramatically reduce space
requirement and significantly improve the performance of set membership queries at the cost
of introducing a very small membership test error probability.
There are various usages for a Membership Library in a very large set of applications and
workloads. Interested readers can refer to [Member-survey] for a survey of possible networking
usages. The above figure provide a small set of examples of using the Membership Library:
• Sub-figure (a) depicts a distributed web cache architecture where a collection of proxies
attempt to share their web caches (cached from a set of back-end web servers) to pro-
vide faster responses to clients, and the proxies use the Membership Library to share
summaries of what web pages/objects they are caching. With the Membership Library,
a proxy receiving an http request will inquire the set-summary to find its location and
quickly determine whether to retrieve the requested web page from a nearby proxy or
from a back-end web server.
144
Programmer’s Guide, Release 17.11.10
Encode
ID
setsum
SUM Packet
matching
Criteria
List 1 1
Flow Key List 2
Summary
• Sub-figure (b) depicts another example for using the Membership Library to prevent rout-
ing loops which is typically done using slow TTL countdown and dropping packets when
TTL expires. As shown in Sub-figure (b), an embedded set-summary in the packet
header itself can be used to summarize the set of nodes a packet has gone through,
and each node upon receiving a packet can check whether its id is a member of the set
of visited nodes, and if it is, then a routing loop is detected.
• Sub-Figure (c) presents another usage of the Membership Library to load-balance flows
to worker threads with in-order guarantee where a set-summary is used to query if a
packet belongs to an existing flow or a new flow. Packets belonging to a new flow are
forwarded to the current least loaded worker thread, while those belonging to an existing
flow are forwarded to the pre-assigned thread to guarantee in-order processing.
• Sub-figure (d) highlights yet another usage example in the database domain where a set-
summary is used to determine joins between sets instead of creating a join by comparing
each element of a set against the other elements in a different set, a join is done on the
summaries since they can efficiently encode members of a given set.
Membership Library is a configurable library that is optimized to cover set membership function-
ality for both a single set and multi-set scenarios. Two set-summary schemes are presented
including (a) vector of Bloom Filters and (b) Hash-Table based set-summary schemes with
and without false negative probability. This guide first briefly describes these different types of
set-summaries, usage examples for each, and then it highlights the Membership Library API.
some probability of false positives and zero false negatives; a query for an element returns
either it is “possibly in a set” (with very high probability) or “definitely not in a set”.
The BF is a method for representing a set of n elements (for example flow keys in network
applications domain) to support membership queries. The idea of BF is to allocate a bit-vector
v with m bits, which are initially all set to 0. Then it chooses k independent hash functions
h1, h2, ... hk with hash values range from 0 to m-1 to perform hashing calculations on each
element to be inserted. Every time when an element X being inserted into the set, the bits
at positions h1(X), h2(X), ... hk(X) in v are set to 1 (any particular bit might be set to 1
multiple times for multiple different inserted elements). Given a query for any element Y, the
bits at positions h1(Y), h2(Y), ... hk(Y) are checked. If any of them is 0, then Y is definitely
not in the set. Otherwise there is a high probability that Y is a member of the set with certain
false positive probability. As shown in the next equation, the false positive probability can be
made arbitrarily small by changing the number of hash functions (k) and the vector length (m).
Without BF, an accurate membership testing could involve a costly hash table lookup and full
element comparison. The advantage of using a BF is to simplify the membership test into a
series of hash calculations and memory accesses for a small bit-vector, which can be easily
optimized. Hence the lookup throughput (set membership test) can be significantly faster than
a normal hash table lookup with element comparison.
Encode ID
BF
IDsof Packet
BF
IDsof Packet
BF is used for applications that need only one set, and the membership of elements is checked
against the BF. The example discussed in the above figure is one example of potential applica-
tions that uses only one set to capture the node IDs that have been visited so far by the packet.
Each node will then check this embedded BF in the packet header for its own id, and if the BF
indicates that the current node is definitely not in the set then a loop-free route is guaranteed.
Element
h1, h2 .. hk Hashing
vectorfor
of lookup/Insertion
BFs happens onceinto a
Lookup/Insertion is done in the series of BFs, one by one or can be optimized to do in parallel.
To support membership test for both multiple sets and a single set, the library implements a
Vector Bloom Filter (vBF) scheme. vBF basically composes multiple bloom filters into a vector
of bloom filers. The membership test is conducted on all of the bloom filters concurrently to
determine which set(s) it belongs to or none of them. The basic idea of vBF is shown in the
above figure where an element is used to address multiple bloom filters concurrently and the
bloom filter index(es) with a hit is returned.
Flow Key
As previously mentioned, there are many usages of such structures. vBF is used for appli-
cations that need to check membership against multiple sets simultaneously. The example
shown in the above figure uses a set to capture all flows being assigned for processing at
a given worker thread. Upon receiving a packet the vBF is used to quickly figure out if this
packet belongs to a new flow so as to be forwarded to the current least loaded worker thread,
or otherwise it should be queued for an existing thread to guarantee in-order processing (i.e.
the property of vBF to indicate right away that a given flow is a new one or not is critical to
minimize response time latency).
It should be noted that vBF can be implemented using a set of single bloom filters with sequen-
tial lookup of each BF. However, being able to concurrently search all set-summaries is a big
throughput advantage. In the library, certain parallelism is realized by the implementation of
checking all bloom filters together.
Hash-table based set-summary (HTSS) is another scheme in the membership library. Cuckoo
filter [Member-cfilter] is an example of HTSS. HTSS supports multi-set membership testing like
vBF does. However, while vBF is better for a small number of targets, HTSS is more suitable
and can easily outperform vBF when the number of sets is large, since HTSS uses a single
hash table for membership testing while vBF requires testing a series of Bloom Filters each
corresponding to one set. As a result, generally speaking vBF is more adequate for the case
of a small limited number of sets while HTSS should be used with a larger number of sets.
Packet Payload
HTSS
Signatures
target 1 for Match 2
Signatures
target 2 for
Match 1
As shown in the above figure, attack signature matching where each set represents a certain
signature length (for correctness of this example, an attack signature should not be a subset
of another one) in the payload is a good example for using HTSS with 0% false negative (i.e.,
when an element returns not found, it has a 100% certainty that it is not a member of any set).
The packet inspection application benefits from knowing right away that the current payload
does not match any attack signatures in the database to establish its legitimacy, otherwise a
deep inspection of the packet is needed.
HTSS employs a similar but simpler data structure to a traditional hash table, and the major
difference is that HTSS stores only the signatures but not the full keys/elements which can
significantly reduce the footprint of the table. Along with the signature, HTSS also stores a
value to indicate the target set. When looking up an element, the element is hashed and the
HTSS is addressed to retrieve the signature stored. If the signature matches then the value is
retrieved corresponding to the index of the target set which the element belongs to. Because
signatures can collide, HTSS can still has false positive probability. Furthermore, if elements
are allowed to be overwritten or evicted when the hash table becomes full, it will also have a
false negative probability. We discuss this case in the next section.
As previously mentioned, traditional set-summaries (e.g. Bloom Filters) do not have a false
negative probability, i.e., it is 100% certain when an element returns “not to be present” for a
given set. However, the Membership Library also supports a set-summary probabilistic data
structure based on HTSS which allows for false negative probability.
In HTSS, when the hash table becomes full, keys/elements will fail to be added into the table
and the hash table has to be resized to accommodate for these new elements, which can be
expensive. However, if we allow new elements to overwrite or evict existing elements (as a
cache typically does), then the resulting set-summary will begin to have false negative proba-
bility. This is because the element that was evicted from the set-summary may still be present
in the target set. For subsequent inquiries the set-summary will falsely report the element not
being in the set, hence having a false negative probability.
The major usage of HTSS with false negative is to use it as a cache for distributing elements to
different target sets. By allowing HTSS to evict old elements, the set-summary can keep track
of the most recent elements (i.e. active) as a cache typically does. Old inactive elements (in-
frequently used elements) will automatically and eventually get evicted from the set-summary.
It is worth noting that the set-summary still has false positive probability, which means the ap-
plication either can tolerate certain false positive or it has fall-back path when false positive
happens.
Active
New/Inactive Flow ID1
Flow ID2
Miss Negative
HTSS with(Cache)
False
Target
Flow IDfor
1
FlowMask
Keys1Matching
FlowMask
Keys2Matching
Match
Fig. 18.7: Using HTSS with False Negatives for Wild Card Classification
HTSS with false negative (i.e. a cache) also has its wide set of applications. For example wild
card flow classification (e.g. ACL rules) highlighted in the above figure is an example of such
application. In that case each target set represents a sub-table with rules defined by a certain
flow mask. The flow masks are non-overlapping, and for flows matching more than one rule
only the highest priority one is inserted in the corresponding sub-table (interested readers can
refer to the Open vSwitch (OvS) design of Mega Flow Cache (MFC) [Member-OvS] for further
details). Typically the rules will have a large number of distinct unique masks and hence, a
large number of target sets each corresponding to one mask. Because the active set of flows
varies widely based on the network traffic, HTSS with false negative will act as a cache for
<flowid, target ACL sub-table> pair for the current active set of flows. When a miss occurs (as
shown in red in the above figure) the sub-tables will be searched sequentially one by one for
a possible match, and when found the flow key and target sub-table will be inserted into the
set-summary (i.e. cache insertion) so subsequent packets from the same flow don’t incur the
overhead of the sequential search of sub-tables.
The design goal of the Membership Library API is to be as generic as possible to support all
the different types of set-summaries we discussed in previous sections and beyond. Funda-
mentally, the APIs need to include creation, insertion, deletion, and lookup.
vBF does not support deletion 1 . An error code -EINVAL will be returned.
18.5 References
1
Traditional bloom filter does not support proactive deletion. Supporting proactive deletion require additional
implementation and performance overhead.
NINETEEN
LPM LIBRARY
The DPDK LPM library component implements the Longest Prefix Match (LPM) table search
method for 32-bit keys that is typically used to find the best route match in IP forwarding appli-
cations.
The main configuration parameter for LPM component instances is the maximum number of
rules to support. An LPM prefix is represented by a pair of parameters (32- bit key, depth), with
depth in the range of 1 to 32. An LPM rule is represented by an LPM prefix and some user
data associated with the prefix. The prefix serves as the unique identifier of the LPM rule. In
this implementation, the user data is 1-byte long and is called next hop, in correlation with its
main use of storing the ID of the next hop in a routing table entry.
The main methods exported by the LPM component are:
• Add LPM rule: The LPM rule is provided as input. If there is no rule with the same prefix
present in the table, then the new rule is added to the LPM table. If a rule with the same
prefix is already present in the table, the next hop of the rule is updated. An error is
returned when there is no available rule space left.
• Delete LPM rule: The prefix of the LPM rule is provided as input. If a rule with the
specified prefix is present in the LPM table, then it is removed.
• Lookup LPM key: The 32-bit key is provided as input. The algorithm selects the rule that
represents the best match for the given key and returns the next hop of that rule. In the
case that there are multiple rules present in the LPM table that have the same 32-bit key,
the algorithm picks the rule with the highest depth as the best match rule, which means
that the rule has the highest number of most significant bits matching between the input
key and the rule key.
The current implementation uses a variation of the DIR-24-8 algorithm that trades memory
usage for improved LPM lookup speed. The algorithm allows the lookup operation to be per-
formed with typically a single memory read access. In the statistically rare case when the best
match rule is having a depth bigger than 24, the lookup operation requires two memory read
accesses. Therefore, the performance of the LPM lookup operation is greatly influenced by
whether the specific memory location is present in the processor cache or not.
153
Programmer’s Guide, Release 17.11.10
finished or not respectively. The depth or length of the rule is the number of bits of the rule that
is stored in a specific entry.
An entry in a tbl8 contains the following fields:
• next hop
• valid
• valid group
• depth
Next hop and depth contain the same information as in the tbl24. The two flags show whether
the entry and the table are valid respectively.
The other main data structure is a table containing the main information about the rules (IP
and next hop). This is a higher level table, used for different things:
• Check whether a rule already exists or not, prior to addition or deletion, without having to
actually perform a lookup.
• When deleting, to check whether there is a rule containing the one that is to be deleted.
This is important, since the main data structure will have to be updated accordingly.
19.2.1 Addition
When adding a rule, there are different possibilities. If the rule’s depth is exactly 24 bits, then:
• Use the rule (IP address) as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then set its next hop to its value,
the valid flag to 1 (meaning this entry is in use), and the external entry flag to 0 (meaning
the lookup process ends at this point, since this is the longest prefix that matches).
If the rule’s depth is exactly 32 bits, then:
• Use the first 24 bits of the rule as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then look for a free tbl8, set
the index to the tbl8 to this value, the valid flag to 1 (meaning this entry is in use), and the
external entry flag to 1 (meaning the lookup process must continue since the rule hasn’t
been explored completely).
If the rule’s depth is any other value, prefix expansion must be performed. This means the rule
is copied to all the entries (as long as they are not in use) which would also cause a match.
As a simple example, let’s assume the depth is 20 bits. This means that there are 2^(24 -
20) = 16 different combinations of the first 24 bits of an IP address that would cause a match.
Hence, in this case, we copy the exact same entry to every position indexed by one of these
combinations.
By doing this we ensure that during the lookup process, if a rule matching the IP address exists,
it is found in either one or two memory accesses, depending on whether we need to move to
the next table or not. Prefix expansion is one of the keys of this algorithm, since it improves the
speed dramatically by adding redundancy.
19.2.2 Lookup
There are different things that limit the number of rules that can be added. The first one is the
maximum number of rules, which is a parameter passed through the API. Once this number is
reached, it is not possible to add any more rules to the routing table unless one or more are
removed.
The second reason is an intrinsic limitation of the algorithm. As explained before, to avoid high
memory consumption, the number of tbl8s is limited in compilation time (this value is by default
256). If we exhaust tbl8s, we won’t be able to add any more rules. How many of them are
necessary for a specific routing table is hard to determine in advance.
A tbl8 is consumed whenever we have a new rule with depth bigger than 24, and the first 24
bits of this rule are not the same as the first 24 bits of a rule previously added. If they are, then
the new rule will share the same tbl8 than the previous one, since the only difference between
the two rules is within the last byte.
With the default value of 256, we can have up to 256 rules longer than 24 bits that differ on
their first three bytes. Since routes longer than 24 bits are unlikely, this shouldn’t be a problem
in most setups. Even if it is, however, the number of tbl8s can be modified.
The LPM algorithm is used to implement Classless Inter-Domain Routing (CIDR) strategy used
by routers implementing IPv4 forwarding.
19.2.5 References
TWENTY
LPM6 LIBRARY
The LPM6 (LPM for IPv6) library component implements the Longest Prefix Match (LPM) ta-
ble search method for 128-bit keys that is typically used to find the best match route in IPv6
forwarding applications.
157
Programmer’s Guide, Release 17.11.10
This is a modification of the algorithm used for IPv4 (see Implementation Details). In this case,
instead of using two levels, one with a tbl24 and a second with a tbl8, 14 levels are used.
The implementation can be seen as a multi-bit trie where the stride or number of bits inspected
on each level varies from level to level. Specifically, 24 bits are inspected on the root node, and
the remaining 104 bits are inspected in groups of 8 bits. This effectively means that the trie
has 14 levels at the most, depending on the rules that are added to the table.
The algorithm allows the lookup operation to be performed with a number of memory accesses
that directly depends on the length of the rule and whether there are other rules with bigger
depths and the same key in the data structure. It can vary from 1 to 14 memory accesses, with
5 being the average value for the lengths that are most commonly used in IPv6.
The main data structure is built using the following elements:
• A table with 224 entries
• A number of tables, configurable by the user through the API, with 28 entries
The first table, called tbl24, is indexed using the first 24 bits of the IP address be looked up,
while the rest of the tables, called tbl8s, are indexed using the rest of the bytes of the IP
address, in chunks of 8 bits. This means that depending on the outcome of trying to match
the IP address of an incoming packet to the rule stored in the tbl24 or the subsequent tbl8s we
might need to continue the lookup process in deeper levels of the tree.
Similar to the limitation presented in the algorithm for IPv4, to store every possible IPv6 rule,
we would need a table with 2^128 entries. This is not feasible due to resource restrictions.
By splitting the process in different tables/levels and limiting the number of tbl8s, we can greatly
reduce memory consumption while maintaining a very good lookup speed (one memory ac-
cess per level).
An entry in a table contains the following fields:
• next hop / index to the tbl8
• depth of the rule (length)
• valid flag
• valid group flag
• external entry flag
The first field can either contain a number indicating the tbl8 in which the lookup process should
continue or the next hop itself if the longest prefix match has already been found. The depth
or length of the rule is the number of bits of the rule that is stored in a specific entry. The flags
are used to determine whether the entry/table is valid or not and whether the search process
have finished or not respectively.
Both types of tables share the same structure.
The other main data structure is a table containing the main information about the rules (IP,
next hop and depth). This is a higher level table, used for different things:
• Check whether a rule already exists or not, prior to addition or deletion, without having to
actually perform a lookup.
When deleting, to check whether there is a rule containing the one that is to be deleted. This
is important, since the main data structure will have to be updated accordingly.
20.1.2 Addition
When adding a rule, there are different possibilities. If the rule’s depth is exactly 24 bits, then:
• Use the rule (IP address) as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then set its next hop to its value,
the valid flag to 1 (meaning this entry is in use), and the external entry flag to 0 (meaning
the lookup process ends at this point, since this is the longest prefix that matches).
If the rule’s depth is bigger than 24 bits but a multiple of 8, then:
• Use the first 24 bits of the rule as an index to the tbl24.
• If the entry is invalid (i.e. it doesn’t already contain a rule) then look for a free tbl8, set
the index to the tbl8 to this value, the valid flag to 1 (meaning this entry is in use), and the
external entry flag to 1 (meaning the lookup process must continue since the rule hasn’t
been explored completely).
• Use the following 8 bits of the rule as an index to the next tbl8.
• Repeat the process until the tbl8 at the right level (depending on the depth) has been
reached and fill it with the next hop, setting the next entry flag to 0.
If the rule’s depth is any other value, prefix expansion must be performed. This means the rule
is copied to all the entries (as long as they are not in use) which would also cause a match.
As a simple example, let’s assume the depth is 20 bits. This means that there are 2^(24-20)
= 16 different combinations of the first 24 bits of an IP address that would cause a match.
Hence, in this case, we copy the exact same entry to every position indexed by one of these
combinations.
By doing this we ensure that during the lookup process, if a rule matching the IP address exists,
it is found in, at the most, 14 memory accesses, depending on how many times we need to
move to the next table. Prefix expansion is one of the keys of this algorithm, since it improves
the speed dramatically by adding redundancy.
Prefix expansion can be performed at any level. So, for example, is the depth is 34 bits, it will
be performed in the third level (second tbl8-based level).
20.1.3 Lookup
There are different things that limit the number of rules that can be added. The first one is the
maximum number of rules, which is a parameter passed through the API. Once this number is
reached, it is not possible to add any more rules to the routing table unless one or more are
removed.
The second limitation is in the number of tbl8s available. If we exhaust tbl8s, we won’t be able
to add any more rules. How to know how many of them are necessary for a specific routing
table is hard to determine in advance.
In this algorithm, the maximum number of tbl8s a single rule can consume is 13, which is the
number of levels minus one, since the first three bytes are resolved in the tbl24. However:
• Typically, on IPv6, routes are not longer than 48 bits, which means rules usually take up
to 3 tbl8s.
As explained in the LPM for IPv4 algorithm, it is possible and very likely that several rules will
share one or more tbl8s, depending on what their first bytes are. If they share the same first 24
bits, for instance, the tbl8 at the second level will be shared. This might happen again in deeper
levels, so, effectively, two 48 bit-long rules may use the same three tbl8s if the only difference
is in their last byte.
The number of tbl8s is a parameter exposed to the user through the API in this version of the
algorithm, due to its impact in memory consumption and the number or rules that can be added
to the LPM table. One tbl8 consumes 1 kilobyte of memory.
The LPM algorithm is used to implement the Classless Inter-Domain Routing (CIDR) strategy
used by routers implementing IP forwarding.
TWENTYONE
DPDK provides a Flow Classification library that provides the ability to classify an input packet
by matching it against a set of Flow rules.
The initial implementation supports counting of IPv4 5-tuple packets which match a particular
Flow rule only.
Please refer to the Generic flow API (rte_flow) for more information.
The Flow Classification library uses the librte_table API for managing Flow rules and
matching packets against the Flow rules. The library is table agnostic and can use the following
tables: Access Control List, Hash and Longest Prefix Match(LPM). The Access
Control List table is used in the initial implementation.
Please refer to the Packet Framework for more information.on librte_table.
DPDK provides an Access Control List library that provides the ability to classify an input packet
based on a set of classification rules.
Please refer to the Packet Classification and Access Control library for more information on
librte_acl.
There is also a Flow Classify sample application which demonstrates the use of the Flow
Classification Library API’s.
Please refer to the ../sample_app_ug/flow_classify for more information on the
flow_classify sample application.
21.1 Overview
/**
* Flow classifier free
*
162
Programmer’s Guide, Release 17.11.10
* @param cls
* Handle to flow classifier instance
* @return
* 0 on success, error code otherwise
*/
int
rte_flow_classifier_free(struct rte_flow_classifier *cls);
/**
* Flow classify table create
*
* @param cls
* Handle to flow classifier instance
* @param params
* Parameters for flow_classify table creation
* @param table_id
* Table ID. Valid only within the scope of table IDs of the current
* classifier. Only returned after a successful invocation.
* @return
* 0 on success, error code otherwise
*/
int
rte_flow_classify_table_create(struct rte_flow_classifier *cls,
struct rte_flow_classify_table_params *params,
uint32_t *table_id);
/**
* Add a flow classify rule to the flow_classifier table.
*
* @param[in] cls
* Flow classifier handle
* @param[in] table_id
* id of table
* @param[in] attr
* Flow rule attributes
* @param[in] pattern
* Pattern specification (list terminated by the END pattern item).
* @param[in] actions
* Associated actions (list terminated by the END pattern item).
* @param[out] error
* Perform verbose error reporting if not NULL. Structure
* initialised in case of error only.
* @return
* A valid handle in case of success, NULL otherwise.
* /
struct rte_flow_classify_rule *
rte_flow_classify_table_entry_add(struct rte_flow_classifier *cls,
uint32_t table_id,
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
struct rte_flow_error *error);
/**
* Delete a flow classify rule from the flow_classifier table.
*
* @param[in] cls
* Flow classifier handle
* @param[in] table_id
* id of table
* @param[in] rule
* Flow classify rule
* @return
/**
* Query flow classifier for given rule.
*
* @param[in] cls
* Flow classifier handle
* @param[in] table_id
* id of table
* @param[in] pkts
* Pointer to packets to process
* @param[in] nb_pkts
* Number of packets to process
* @param[in] rule
* Flow classify rule
* @param[in] stats
* Flow classify stats
*
* @return
* 0 on success, error code otherwise.
* /
int
rte_flow_classifier_query(struct rte_flow_classifier *cls,
uint32_t table_id,
struct rte_mbuf **pkts,
const uint16_t nb_pkts,
struct rte_flow_classify_rule *rule,
struct rte_flow_classify_stats *stats);
/** CPU socket ID where memory for the flow classifier and its */
/** elements (tables) should be allocated */
int socket_id;
};
struct rte_flow_classifier {
/* Input parameters */
char name[RTE_FLOW_CLASSIFIER_MAX_NAME_SZ];
int socket_id;
enum rte_flow_classify_table_type type;
/* Internal tables */
struct rte_table tables[RTE_FLOW_CLASSIFY_TABLE_MAX];
uint32_t num_tables;
uint16_t nb_pkts;
struct rte_flow_classify_table_entry
*entries[RTE_PORT_IN_BURST_SIZE_MAX];
} __rte_cache_aligned;
To create an ACL table the rte_table_acl_params structure must be initialised and as-
signed to arg_create in the rte_flow_classify_table_params structure.
struct rte_table_acl_params {
/** Name */
const char *name;
The fields for the ACL rule must also be initialised by the application.
An ACL table can be added to the Classifier for each ACL rule, for example another table
could be added for the IPv6 5-tuple rule.
The library currently supports three IPv4 5-tuple flow patterns, for UDP, TCP and SCTP.
/* Pattern for IPv4 5-tuple UDP filter */
static enum rte_flow_item_type pattern_ntuple_1[] = {
RTE_FLOW_ITEM_TYPE_ETH,
RTE_FLOW_ITEM_TYPE_IPV4,
RTE_FLOW_ITEM_TYPE_UDP,
RTE_FLOW_ITEM_TYPE_END,
};
The internal function flow_classify_parse_flow parses the IPv4 5-tuple pattern, at-
tributes and actions and returns the 5-tuple data in the rte_eth_ntuple_filter structure.
static int
flow_classify_parse_flow(
const struct rte_flow_attr *attr,
const struct rte_flow_item pattern[],
const struct rte_flow_action actions[],
struct rte_flow_error *error)
struct classify_rules {
enum rte_flow_classify_rule_type type;
union {
struct rte_flow_classify_ipv4_5tuple ipv4_5tuple;
} u;
};
struct rte_flow_classify {
uint32_t id; /* unique ID of classify object */
It then calls the table[table_id].ops.f_add API to add the rule to the ACL table.
The rte_flow_classifier_query API is used to find packets which match a given flow
Flow rule in the table. This API calls the flow_classify_run internal function which calls the
table[table_id].ops.f_lookup API to see if any packets in a burst match any of the
Flow rules in the table. The meta data for the highest priority rule matched for each packet
is returned in the entries array in the rte_flow_classify object. The internal function
action_apply implements the Count action which is used to return data which matches a
particular Flow rule.
The rte_flow_classifier_query API uses the following structures to return data to the applica-
tion.
/** IPv4 5-tuple data */
struct rte_flow_classify_ipv4_5tuple {
uint32_t dst_ip; /**< Destination IP address in big endian. */
uint32_t dst_ip_mask; /**< Mask of destination IP address. */
uint32_t src_ip; /**< Source IP address in big endian. */
uint32_t src_ip_mask; /**< Mask of destination IP address. */
uint16_t dst_port; /**< Destination port in big endian. */
uint16_t dst_port_mask; /**< Mask of destination port. */
uint16_t src_port; /**< Source Port in big endian. */
uint16_t src_port_mask; /**< Mask of source port. */
uint8_t proto; /**< L4 protocol. */
uint8_t proto_mask; /**< Mask of L4 protocol. */
};
/**
* Flow stats
*
* For the count action, stats can be returned by the query API.
*
* Storage for stats is provided by the application.
*
*
*/
struct rte_flow_classify_stats {
void *stats;
};
struct rte_flow_classify_5tuple_stats {
TWENTYTWO
The DPDK Packet Distributor library is a library designed to be used for dynamic load balancing
of traffic while supporting single packet at a time operation. When using this library, the logical
cores in use are to be considered in two roles: firstly a distributor lcore, which is responsible
for load balancing or distributing packets, and a set of worker lcores which are responsible for
receiving the packets from the distributor and operating on them. The model of operation is
shown in the diagram below.
169
Programmer’s Guide, Release 17.11.10
There are two modes of operation of the API in the distributor library, one which sends one
packet at a time to workers using 32-bits for flow_id, and an optimized mode which sends
bursts of up to 8 packets at a time to workers, using 15 bits of flow_id. The mode is selected
by the type field in the rte_distributor_create() function.
The distributor core does the majority of the processing for ensuring that packets are fairly
shared among workers. The operation of the distributor is as follows:
1. Packets are passed to the distributor component by having the distributor lcore thread
call the “rte_distributor_process()” API
2. The worker lcores all share a single cache line with the distributor core in order to pass
messages and packets to and from the worker. The process API call will poll all the
worker cache lines to see what workers are requesting packets.
3. As workers request packets, the distributor takes packets from the set of packets passed
in and distributes them to the workers. As it does so, it examines the “tag” – stored in the
RSS hash field in the mbuf – for each packet and records what tags are being processed
by each worker.
4. If the next packet in the input set has a tag which is already being processed by a worker,
then that packet will be queued up for processing by that worker and given to it in prefer-
ence to other packets when that work next makes a request for work. This ensures that
no two packets with the same tag are processed in parallel, and that all packets with the
same tag are processed in input order.
5. Once all input packets passed to the process API have either been distributed to workers
or been queued up for a worker which is processing a given tag, then the process API
returns to the caller.
Other functions which are available to the distributor lcore are:
• rte_distributor_returned_pkts()
• rte_distributor_flush()
• rte_distributor_clear_returns()
Of these the most important API call is “rte_distributor_returned_pkts()” which should only be
called on the lcore which also calls the process API. It returns to the caller all packets which
have finished processing by all worker cores. Within this set of returned packets, all packets
sharing the same tag will be returned in their original order.
NOTE: If worker lcores buffer up packets internally for transmission in bulk afterwards, the
packets sharing a tag will likely get out of order. Once a worker lcore requests a new packet,
the distributor assumes that it has completely finished with the previous packet and therefore
that additional packets with the same tag can safely be distributed to other workers – who may
then flush their buffered packets sooner and cause packets to get out of order.
NOTE: No packet ordering guarantees are made about packets which do not share a common
packet tag.
Using the process and returned_pkts API, the following application workflow can be used, while
allowing packet order within a packet flow – identified by a tag – to be maintained.
The flush and clear_returns API calls, mentioned previously, are likely of less use that the
process and returned_pkts APIS, and are principally provided to aid in unit testing of the li-
brary. Descriptions of these functions and their use can be found in the DPDK API Reference
document.
Worker cores are the cores which do the actual manipulation of the packets distributed by the
packet distributor. Each worker calls “rte_distributor_get_pkt()” API to request a new packet
when it has finished processing the previous one. [The previous packet should be returned to
the distributor component by passing it as the final parameter to this API call.]
Since it may be desirable to vary the number of worker cores, depending on the traffic load i.e.
to save power at times of lighter load, it is possible to have a worker stop processing packets
by calling “rte_distributor_return_pkt()” to indicate that it has finished the current packet and
does not want a new one.
TWENTYTHREE
REORDER LIBRARY
The Reorder Library provides a mechanism for reordering mbufs based on their sequence
number.
23.1 Operation
The reorder library is essentially a buffer that reorders mbufs. The user inserts out of order
mbufs into the reorder buffer and pulls in-order mbufs from it.
At a given time, the reorder buffer contains mbufs whose sequence number are inside the
sequence window. The sequence window is determined by the minimum sequence number
and the number of entries that the buffer was configured to hold. For example, given a reorder
buffer with 200 entries and a minimum sequence number of 350, the sequence window has
low and high limits of 350 and 550 respectively.
When inserting mbufs, the reorder library differentiates between valid, early and late mbufs
depending on the sequence number of the inserted mbuf:
• valid: the sequence number is inside the window.
• late: the sequence number is outside the window and less than the low limit.
• early: the sequence number is outside the window and greater than the high limit.
The reorder buffer directly returns late mbufs and tries to accommodate early mbufs.
The reorder library is implemented as a pair of buffers, which referred to as the Order buffer
and the Ready buffer.
On an insert call, valid mbufs are inserted directly into the Order buffer and late mbufs are
returned to the user with an error.
In the case of early mbufs, the reorder buffer will try to move the window (incrementing the
minimum sequence number) so that the mbuf becomes a valid one. To that end, mbufs in the
Order buffer are moved into the Ready buffer. Any mbufs that have not arrived yet are ignored
and therefore will become late mbufs. This means that as long as there is room in the Ready
buffer, the window will be moved to accommodate early mbufs that would otherwise be outside
the reordering window.
172
Programmer’s Guide, Release 17.11.10
For example, assuming that we have a buffer of 200 entries with a 350 minimum sequence
number, and we need to insert an early mbuf with 565 sequence number. That means that we
would need to move the windows at least 15 positions to accommodate the mbuf. The reorder
buffer would try to move mbufs from at least the next 15 slots in the Order buffer to the Ready
buffer, as long as there is room in the Ready buffer. Any gaps in the Order buffer at that point
are skipped, and those packet will be reported as late packets when they arrive. The process
of moving packets to the Ready buffer continues beyond the minimum required until a gap, i.e.
missing mbuf, in the Order buffer is encountered.
When draining mbufs, the reorder buffer would return mbufs in the Ready buffer first and then
from the Order buffer until a gap is found (mbufs that have not arrived yet).
An application using the DPDK packet distributor could make use of the reorder library to
transmit packets in the same order they were received.
A basic packet distributor use case would consist of a distributor with multiple workers cores.
The processing of packets by the workers is not guaranteed to be in order, hence a reorder
buffer can be used to order as many packets as possible.
In such a scenario, the distributor assigns a sequence number to mbufs before delivering them
to the workers. As the workers finish processing the packets, the distributor inserts those mbufs
into the reorder buffer and finally transmit drained mbufs.
NOTE: Currently the reorder buffer is not thread safe so the same thread is responsible for
inserting and draining mbufs.
TWENTYFOUR
The IP Fragmentation and Reassembly Library implements IPv4 and IPv6 packet fragmenta-
tion and reassembly.
Packet fragmentation routines divide input packet into number of fragments. Both
rte_ipv4_fragment_packet() and rte_ipv6_fragment_packet() functions assume that input mbuf
data points to the start of the IP header of the packet (i.e. L2 header is already stripped out).
To avoid copying of the actual packet’s data zero-copy technique is used (rte_pktmbuf_attach).
For each fragment two new mbufs are created:
• Direct mbuf – mbuf that will contain L3 header of the new fragment.
• Indirect mbuf – mbuf that is attached to the mbuf with the original packet. It’s data field
points to the start of the original packets data plus fragment offset.
Then L3 header is copied from the original mbuf into the ‘direct’ mbuf and updated to reflect
new fragmented status. Note that for IPv4, header checksum is not recalculated and is set to
zero.
Finally ‘direct’ and ‘indirect’ mbufs for each fragment are linked together via mbuf’s next filed to
compose a packet for the new fragment.
The caller has an ability to explicitly specify which mempools should be used to allocate ‘direct’
and ‘indirect’ mbufs from.
For more information about direct and indirect mbufs, refer to Direct and Indirect Buffers.
Fragment table maintains information about already received fragments of the packet.
Each IP packet is uniquely identified by triple <Source IP address>, <Destination IP address>,
<ID>.
Note that all update/lookup operations on Fragment Table are not thread safe. So if different
execution contexts (threads/processes) will access the same table simultaneously, then some
external syncing mechanism have to be provided.
174
Programmer’s Guide, Release 17.11.10
Internally Fragment table is a simple hash table. The basic idea is to use two hash functions
and <bucket_entries> * associativity. This provides 2 * <bucket_entries> possible locations
in the hash table for each key. When the collision occurs and all 2 * <bucket_entries> are
occupied, instead of reinserting existing keys into alternative locations, ip_frag_tbl_add() just
returns a failure.
Also, entries that resides in the table longer then <max_cycles> are considered as invalid, and
could be removed/replaced by the new ones.
Note that reassembly demands a lot of mbuf’s to be allocated. At any given time up to (2 *
bucket_entries * RTE_LIBRTE_IP_FRAG_MAX * <maximum number of mbufs per packet>)
can be stored inside Fragment Table waiting for remaining fragments.
TWENTYFIVE
Generic Receive Offload (GRO) is a widely used SW-based offloading technique to reduce
per-packet processing overhead. It gains performance by reassembling small packets into
large ones. To enable more flexibility to applications, DPDK implements GRO as a standalone
library. Applications explicitly use the GRO library to merge small packets into large ones.
The GRO library assumes all input packets have correct checksums. In addition, the GRO
library doesn’t re-calculate checksums for merged packets. If input packets are IP fragmented,
the GRO library assumes they are complete packets (i.e. with L4 headers).
Currently, the GRO library implements TCP/IPv4 packet reassembly.
The GRO library provides two reassembly modes: lightweight and heavyweight mode. If ap-
plications want to merge packets in a simple way, they can use the lightweight mode API. If
applications want more fine-grained controls, they can choose the heavyweight mode API.
177
Programmer’s Guide, Release 17.11.10
Before performing GRO, applications need to create a GRO context object by calling
rte_gro_ctx_create(). A GRO context object holds the reassembly tables of desired
GRO types. Note that all update/lookup operations on the context object are not thread safe.
So if different processes or threads want to access the same context object simultaneously,
some external syncing mechanisms must be used.
Once the GRO context is created, applications can then use the rte_gro_reassemble()
function to merge packets. In each invocation, rte_gro_reassemble() tries to merge input
packets with the packets in the reassembly tables. If an input packet is an unsupported GRO
type, or other errors happen (e.g. SYN bit is set), rte_gro_reassemble() returns the packet
to applications. Otherwise, the input packet is either merged or inserted into a reassembly
table.
When applications want to get GRO processed packets, they need to use
rte_gro_timeout_flush() to flush them from the tables manually.
TCP/IPv4 GRO supports merging small TCP/IPv4 packets into large ones, using a table struc-
ture called the TCP/IPv4 reassembly table.
A TCP/IPv4 reassembly table includes a “key” array and an “item” array. The key array keeps
the criteria to merge packets and the item array keeps the packet information.
Each key in the key array points to an item group, which consists of packets which have the
same criteria values but can’t be merged. A key in the key array includes two parts:
• criteria: the criteria to merge packets. If two packets can be merged, they must have
the same criteria values.
• start_index: the item array index of the first packet in the item group.
Each element in the item array keeps the information of a packet. An item in the item array
mainly includes three parts:
• firstseg: the mbuf address of the first segment of the packet.
• lastseg: the mbuf address of the last segment of the packet.
• next_pkt_index: the item array index of the next packet in the same item group.
TCP/IPv4 GRO uses next_pkt_index to chain the packets that have the same cri-
teria value but can’t be merged together.
2. Traverse the key array to find a key which has the same criteria value with the incoming
packet. If found, go to the next step. Otherwise, insert a new key and a new item for the
packet.
3. Locate the first packet in the item group via start_index. Then traverse all packets
in the item group via next_pkt_index. If a packet is found which can be merged with
the incoming one, merge them together. If one isn’t found, insert the packet into this item
group. Note that to merge two packets is to link them together via mbuf’s next field.
When packets are flushed from the reassembly table, TCP/IPv4 GRO updates packet header
fields for the merged packets. Note that before reassembling the packet, TCP/IPv4 GRO
doesn’t check if the checksums of packets are correct. Also, TCP/IPv4 GRO doesn’t re-
calculate checksums for merged packets.
TWENTYSIX
26.1 Overview
Generic Segmentation Offload (GSO) is a widely used software implementation of TCP Seg-
mentation Offload (TSO), which reduces per-packet processing overhead. Much like TSO,
GSO gains performance by enabling upper layer applications to process a smaller number of
large packets (e.g. MTU size of 64KB), instead of processing higher numbers of small packets
(e.g. MTU size of 1500B), thus reducing per-packet overhead.
For example, GSO allows guest kernel stacks to transmit over-sized TCP segments that far
exceed the kernel interface’s MTU; this eliminates the need to segment packets within the
guest, and improves the data-to-overhead ratio of both the guest-host link, and PCI bus. The
expectation of the guest network stack in this scenario is that segmentation of egress frames
will take place either in the NIC HW, or where that hardware capability is unavailable, either in
the host application, or network stack.
Bearing that in mind, the GSO library enables DPDK applications to segment packets in soft-
ware. Note however, that GSO is implemented as a standalone library, and not via a ‘fallback’
mechanism (i.e. for when TSO is unsupported in the underlying hardware); that is, applica-
tions must explicitly invoke the GSO library to segment packets. The size of GSO segments
(segsz) is configurable by the application.
26.2 Limitations
1. The GSO library doesn’t check if input packets have correct checksums.
2. In addition, the GSO library doesn’t re-calculate checksums for segmented packets (that
task is left to the application).
3. IP fragments are unsupported by the GSO library.
4. The egress interface’s driver must support multi-segment packets.
5. Currently, the GSO library supports the following IPv4 packet types:
• TCP
• VxLAN
• GRE
See Supported GSO Packet Types for further details.
180
Programmer’s Guide, Release 17.11.10
To reduce the number of expensive memcpy operations required when segmenting a packet,
the GSO library typically stores each segment that it creates as a two-part mbuf (technically,
this is termed a ‘two-segment’ mbuf; however, since the elements produced by the API are
also called ‘segments’, for clarity the term ‘part’ is used here instead).
The first part of each output segment is a direct mbuf and contains a copy of the original
packet’s headers, which must be prepended to each output segment. These headers are
copied from the original packet into each output segment.
The second part of each output segment, represents a section of data from the original packet,
i.e. a data segment. Rather than copy the data directly from the original packet into the output
segment (which would impact performance considerably), the second part of each output seg-
ment is an indirect mbuf, which contains no actual data, but simply points to an offset within
the original packet.
The combination of the ‘header’ segment and the ‘data’ segment constitutes a single logical
output GSO segment of the original packet. This is illustrated in Fig. 26.1.
segsz
In one situation, the output segment may contain additional ‘data’ segments. This only occurs
when:
• the input packet on which GSO is to be performed is represented by a multi-segment
mbuf.
• the output segment is required to contain data that spans the boundaries between seg-
ments of the input multi-segment mbuf.
The GSO library traverses each segment of the input packet, and produces numerous output
segments; for optimal performance, the number of output segments is kept to a minimum.
Consequently, the GSO library maximizes the amount of data contained within each output
segment; i.e. each output segment segsz bytes of data. The only exception to this is in the
case of the very final output segment; if pkt_len % segsz, then the final segment is smaller
than the rest.
In order for an output segment to meet its MSS, it may need to include data from multiple input
segments. Due to the nature of indirect mbufs (each indirect mbuf can point to only one direct
mbuf), the solution here is to add another indirect mbuf to the output segment; this additional
segment then points to the next input segment. If necessary, this chaining process is repeated,
until the sum of all of the data ‘contained’ in the output segment reaches segsz. This ensures
that the amount of data contained within each output segment is uniform, with the possible
exception of the last segment, as previously described.
Fig. 26.2 illustrates an example of a three-part output segment. In this example, the output
segment needs to include data from the end of one input segment, and the beginning of an-
other. To achieve this, an additional indirect mbuf is chained to the second part of the output
segment, and is attached to the next input segment (i.e. it points to the data in the next input
segment).
pkt_len
segsz
% segsz
next
Header Payload 0 Payload 1 Payload 1 Payload 2 Multi-segment input packet
TCP/IPv4 GSO supports segmentation of suitably large TCP/IPv4 packets, which may also
contain an optional VLAN tag.
VxLAN packets GSO supports segmentation of suitably large VxLAN packets, which contain
an outer IPv4 header, inner TCP/IPv4 headers, and optional inner and/or outer VLAN tag(s).
GRE GSO supports segmentation of suitably large GRE packets, which contain an outer IPv4
header, inner TCP/IPv4 headers, and an optional VLAN tag.
Note: An application may use the same pool for both direct and indirect buffers.
However, since indirect mbufs simply store a pointer, the application may reduce
its memory consumption by creating a separate memory pool, containing smaller
elements, for the indirect pool.
• the size of each output segment, including packet headers and payload, measured
in bytes.
• the bit mask of required GSO types. The GSO library uses the
same macros as those that describe a physical device’s TX offloading
capabilities (i.e. DEV_TX_OFFLOAD_*_TSO) for gso_types. For exam-
ple, if an application wants to segment TCP/IPv4 packets, it should set
gso_types to DEV_TX_OFFLOAD_TCP_TSO. The only other supported values cur-
rently supported for gso_types are DEV_TX_OFFLOAD_VXLAN_TNL_TSO, and
DEV_TX_OFFLOAD_GRE_TNL_TSO; a combination of these macros is also allowed.
• a flag, that indicates whether the IPv4 headers of output segments should contain
fixed or incremental ID values.
2. Set the appropriate ol_flags in the mbuf.
• The GSO library use the value of an mbuf’s ol_flags attribute to determine how
a packet should be segmented. It is the application’s responsibility to ensure that
these flags are set.
• For example, in order to segment TCP/IPv4 packets, the application should add the
PKT_TX_IPV4 and PKT_TX_TCP_SEG flags to the mbuf’s ol_flags.
• If checksum calculation in hardware is required, the application should also add the
PKT_TX_TCP_CKSUM and PKT_TX_IP_CKSUM flags.
3. Check if the packet should be processed. Packets with one of the following properties
are not processed and are returned immediately:
• Packet length is less than segsz (i.e. GSO is not required).
• Packet type is not supported by GSO library (see Supported GSO Packet Types).
• Application has not enabled GSO support for the packet type.
• Packet’s ol_flags have been incorrectly set.
4. Allocate space in which to store the output GSO segments. If the amount of space
allocated by the application is insufficient, segmentation will fail.
5. Invoke the GSO segmentation API, rte_gso_segment().
6. If required, update the L3 and L4 checksums of the newly-created segments. For tun-
neled packets, the outer IPv4 headers’ checksums should also be updated. Alternatively,
the application may offload checksum calculation to HW.
TWENTYSEVEN
The librte_pdump library provides a framework for packet capturing in DPDK. The library
does the complete copy of the Rx and Tx mbufs to a new mempool and hence it slows down
the performance of the applications, so it is recommended to use this library for debugging
purposes.
The library provides the following APIs to initialize the packet capture framework, to enable or
disable the packet capture, and to uninitialize it:
• rte_pdump_init(): This API initializes the packet capture framework.
• rte_pdump_enable(): This API enables the packet capture on a given port and queue.
Note: The filter option in the API is a place holder for future enhancements.
• rte_pdump_enable_by_deviceid(): This API enables the packet capture on a
given device id (vdev name or pci address) and queue. Note: The filter option
in the API is a place holder for future enhancements.
• rte_pdump_disable(): This API disables the packet capture on a given port and
queue.
• rte_pdump_disable_by_deviceid(): This API disables the packet capture on a
given device id (vdev name or pci address) and queue.
• rte_pdump_uninit(): This API uninitializes the packet capture framework.
• rte_pdump_set_socket_dir(): This API sets the server and client socket paths.
Note: This API is not thread-safe.
27.1 Operation
The librte_pdump library works on a client/server model. The server is responsible for
enabling or disabling the packet capture and the clients are responsible for requesting the
enabling or disabling of the packet capture.
The packet capture framework, as part of its initialization, creates the pthread and the server
socket in the pthread. The application that calls the framework initialization will have the server
socket created, either under the path that the application has passed or under the default path
i.e. either /var/run/.dpdk for root user or ~/.dpdk for non root user.
Applications that request enabling or disabling of the packet capture will have the client socket
created either under the path that the application has passed or under the default path i.e.
either /var/run/.dpdk for root user or ~/.dpdk for not root user to send the requests to
185
Programmer’s Guide, Release 17.11.10
the server. The server socket will listen for client requests for enabling or disabling the packet
capture.
The library API rte_pdump_init(), initializes the packet capture framework by creating the
pthread and the server socket. The server socket in the pthread context will be listening to the
client requests to enable or disable the packet capture.
The library APIs rte_pdump_enable() and rte_pdump_enable_by_deviceid() en-
ables the packet capture. On each call to these APIs, the library creates a separate client
socket, creates the “pdump enable” request and sends the request to the server. The server
that is listening on the socket will take the request and enable the packet capture by registering
the Ethernet RX and TX callbacks for the given port or device_id and queue combinations.
Then the server will mirror the packets to the new mempool and enqueue them to the rte_ring
that clients have passed to these APIs. The server also sends the response back to the client
about the status of the request that was processed. After the response is received from the
server, the client socket is closed.
The library APIs rte_pdump_disable() and rte_pdump_disable_by_deviceid() dis-
ables the packet capture. On each call to these APIs, the library creates a separate client
socket, creates the “pdump disable” request and sends the request to the server. The server
that is listening on the socket will take the request and disable the packet capture by removing
the Ethernet RX and TX callbacks for the given port or device_id and queue combinations.
The server also sends the response back to the client about the status of the request that was
processed. After the response is received from the server, the client socket is closed.
The library API rte_pdump_uninit(), uninitializes the packet capture framework by closing
the pthread and the server socket.
The library API rte_pdump_set_socket_dir(), sets the given path as either server socket
path or client socket path based on the type argument of the API. If the given path is NULL,
default path will be selected, i.e. either /var/run/.dpdk for root user or ~/.dpdk for non
root user. Clients also need to call this API to set their server socket path if the server socket
path is different from default path.
The DPDK app/pdump tool is developed based on this library to capture packets in DPDK.
Users can use this as an example to develop their own packet capturing tools.
TWENTYEIGHT
MULTI-PROCESS SUPPORT
In the DPDK, multi-process support is designed to allow a group of DPDK processes to work
together in a simple transparent manner to perform packet processing, or other workloads. To
support this functionality, a number of additions have been made to the core DPDK Environ-
ment Abstraction Layer (EAL).
The EAL has been modified to allow different types of DPDK processes to be spawned, each
with different permissions on the hugepage memory used by the applications. For now, there
are two types of process specified:
• primary processes, which can initialize and which have full permissions on shared mem-
ory
• secondary processes, which cannot initialize shared memory, but can attach to pre- ini-
tialized shared memory and create objects in it.
Standalone DPDK processes are primary processes, while secondary processes can only run
alongside a primary process or after a primary process has already configured the hugepage
shared memory for them.
To support these two process types, and other multi-process setups described later, two addi-
tional command-line parameters are available to the EAL:
• --proc-type: for specifying a given process instance as the primary or secondary
DPDK instance
• --file-prefix: to allow processes that do not want to co-operate to have different
memory regions
A number of example applications are provided that demonstrate how multiple DPDK pro-
cesses can be used together. These are more fully documented in the “Multi- process Sample
Application” chapter in the DPDK Sample Application’s User Guide.
The key element in getting a multi-process application working using the DPDK is to ensure that
memory resources are properly shared among the processes making up the multi-process ap-
plication. Once there are blocks of shared memory available that can be accessed by multiple
processes, then issues such as inter-process communication (IPC) becomes much simpler.
On application start-up in a primary or standalone process, the DPDK records to memory-
mapped files the details of the memory configuration it is using - hugepages in use, the virtual
187
Programmer’s Guide, Release 17.11.10
addresses they are mapped at, the number of memory channels present, etc. When a sec-
ondary process is started, these files are read and the EAL recreates the same memory con-
figuration in the secondary process so that all memory zones are shared between processes
and all pointers to that memory are valid, and point to the same objects, in both processes.
Note: Refer to Multi-process Limitations for details of how Linux kernel Address-Space Layout
Randomization (ASLR) can affect memory sharing.
Primary Process
Secondary Process
struct rte_config
struct hugepage[]
IPC Queue
IPC Queue
Hugepage
Local Data DPDK Local Data
Memory
Mbuf Pool
The EAL also supports an auto-detection mode (set by EAL --proc-type=auto flag ),
whereby an DPDK process is started as a secondary instance if a primary instance is already
running.
DPDK multi-process support can be used to create a set of peer processes where each pro-
cess performs the same workload. This model is equivalent to having multiple threads each
running the same main-loop function, as is done in most of the supplied DPDK sample ap-
plications. In this model, the first of the processes spawned should be spawned using the
--proc-type=primary EAL flag, while all subsequent instances should be spawned using
the --proc-type=secondary flag.
The simple_mp and symmetric_mp sample applications demonstrate this usage model. They
are described in the “Multi-process Sample Application” chapter in the DPDK Sample Applica-
tion’s User Guide.
An alternative deployment model that can be used for multi-process applications is to have
a single primary process instance that acts as a load-balancer or server distributing received
packets among worker or client threads, which are run as secondary processes. In this case,
extensive use of rte_ring objects is made, which are located in shared hugepage memory.
The client_server_mp sample application shows this usage model. It is described in the “Multi-
process Sample Application” chapter in the DPDK Sample Application’s User Guide.
In addition to the above scenarios involving multiple DPDK processes working together, it is
possible to run multiple DPDK processes side-by-side, where those processes are all work-
ing independently. Support for this usage scenario is provided using the --file-prefix
parameter to the EAL.
By default, the EAL creates hugepage files on each hugetlbfs filesystem using the rtemap_X
filename, where X is in the range 0 to the maximum number of hugepages -1. Similarly, it cre-
ates shared configuration files, memory mapped in each process, using the /var/run/.rte_config
filename, when run as root (or $HOME/.rte_config when run as a non-root user; if filesystem
and device permissions are set up to allow this). The rte part of the filenames of each of the
above is configurable using the file-prefix parameter.
In addition to specifying the file-prefix parameter, any DPDK applications that are to be run
side-by-side must explicitly limit their memory use. This is done by passing the -m flag to
each process to specify how much hugepage memory, in megabytes, each process can use
(or passing --socket-mem to specify how much hugepage memory on each socket each
process can use).
Note: Independent DPDK instances running side-by-side on a single machine cannot share
any network ports. Any network ports being used by one process should be blacklisted in every
other process.
In the same way that it is possible to run independent DPDK applications side- by-side on a
single system, this can be trivially extended to multi-process groups of DPDK applications run-
ning side-by-side. In this case, the secondary processes must use the same --file-prefix
parameter as the primary process whose shared memory they are connecting to.
Note: All restrictions and issues with multiple independent DPDK processes running side-by-
side apply in this usage scenario also.
There are a number of limitations to what can be done when running DPDK multi-process
applications. Some of these are documented below:
• The multi-process feature requires that the exact same hugepage memory mappings be
present in all applications. The Linux security feature - Address-Space Layout Random-
ization (ASLR) can interfere with this mapping, so it may be necessary to disable this
feature in order to reliably run multi-process applications.
• All DPDK processes running as a single application and using shared memory must have
distinct coremask/corelist arguments. It is not possible to have a primary and secondary
instance, or two secondary instances, using any of the same logical cores. Attempting to
do so can cause corruption of memory pool caches, among other issues.
• The delivery of interrupts, such as Ethernet* device link status interrupts, do not work
in secondary processes. All interrupts are triggered inside the primary process only.
Any application needing interrupt notification in multiple processes should provide its
own mechanism to transfer the interrupt information from the primary process to any
secondary process that needs the information.
• The use of function pointers between multiple processes running based of different com-
piled binaries is not supported, since the location of a given function in one process may
be different to its location in a second. This prevents the librte_hash library from behav-
ing properly as in a multi-process instance, since it uses a pointer to the hash function
internally.
To work around this issue, it is recommended that multi-process applications perform the
hash calculations by directly calling the hashing function from the code and then using the
rte_hash_add_with_hash()/rte_hash_lookup_with_hash() functions instead of the functions
which do the hashing internally, such as rte_hash_add()/rte_hash_lookup().
• Depending upon the hardware in use, and the number of DPDK processes used, it may
not be possible to have HPET timers available in each DPDK instance. The minimum
number of HPET comparators available to Linux* userspace can be just a single com-
parator, which means that only the first, primary DPDK process instance can open and
mmap /dev/hpet. If the number of required DPDK processes exceeds that of the number
of available HPET comparators, the TSC (which is the default timer in this release) must
be used as a time source across all processes instead of the HPET.
TWENTYNINE
The DPDK Kernel NIC Interface (KNI) allows userspace applications access to the Linux*
control plane.
The benefits of using the DPDK KNI are:
• Faster than existing Linux TUN/TAP interfaces (by eliminating system calls and
copy_to_user()/copy_from_user() operations.
• Allows management of DPDK ports using standard Linux net tools such as ethtool, ifcon-
fig and tcpdump.
• Allows an interface with the kernel network stack.
The components of an application using the DPDK Kernel NIC Interface are shown in Fig. 29.1.
The KNI kernel loadable module provides support for two types of devices:
• A Miscellaneous device (/dev/kni) that:
– Creates net devices (via ioctl calls).
– Maintains a kernel thread context shared by all KNI instances (simulating the RX
side of the net driver).
– For single kernel thread mode, maintains a kernel thread context shared by all KNI
instances (simulating the RX side of the net driver).
– For multiple kernel thread mode, maintains a kernel thread context for each KNI
instance (simulating the RX side of the net driver).
• Net device:
– Net functionality provided by implementing several operations such as netdev_ops,
header_ops, ethtool_ops that are defined by struct net_device, including support for
DPDK mbufs and FIFOs.
– The interface name is provided from userspace.
– The MAC address can be the real NIC MAC address or random.
191
Programmer’s Guide, Release 17.11.10
The KNI interfaces are created by a DPDK application dynamically. The interface name and
FIFO details are provided by the application through an ioctl call using the rte_kni_device_info
struct which contains:
• The interface name.
• Physical addresses of the corresponding memzones for the relevant FIFOs.
• Mbuf mempool details, both physical and virtual (to calculate the offset for mbuf pointers).
• PCI information.
• Core affinity.
Refer to rte_kni_common.h in the DPDK source code for more details.
The physical addresses will be re-mapped into the kernel address space and stored in separate
KNI contexts.
The affinity of kernel RX thread (both single and multi-threaded modes) is controlled by
force_bind and core_id config parameters.
The KNI interfaces can be deleted by a DPDK application dynamically after being created.
Furthermore, all those KNI interfaces not deleted will be deleted on the release operation of
the miscellaneous device (when the DPDK application is closed).
To minimize the amount of DPDK code running in kernel space, the mbuf mempool is managed
in userspace only. The kernel module will be aware of mbufs, but all mbuf allocation and free
operations will be handled by the DPDK application only.
Fig. 29.2 shows a typical scenario with packets sent in both directions.
On the DPDK RX side, the mbuf is allocated by the PMD in the RX thread context. This thread
will enqueue the mbuf in the rx_q FIFO. The KNI thread will poll all KNI active devices for the
rx_q. If an mbuf is dequeued, it will be converted to a sk_buff and sent to the net stack via
netif_rx(). The dequeued mbuf must be freed, so the same pointer is sent back in the free_q
FIFO.
The RX thread, in the same main loop, polls this FIFO and frees the mbuf after dequeuing it.
For packet egress the DPDK application must first enqueue several mbufs to create an mbuf
cache on the kernel side.
The packet is received from the Linux net stack, by calling the kni_net_tx() callback. The mbuf
is dequeued (without waiting due the cache) and filled with data from sk_buff. The sk_buff is
then freed and the mbuf sent in the tx_q FIFO.
The DPDK TX thread dequeues the mbuf and sends it to the PMD (via rte_eth_tx_burst()). It
then puts the mbuf back in the cache.
29.6 Ethtool
Ethtool is a Linux-specific tool with corresponding support in the kernel where each net device
must register its own callbacks for the supported operations. The current implementation uses
the igb/ixgbe modified Linux drivers for ethtool support. Ethtool is not supported in i40e and
VMs (VF or EM devices).
Link state and MTU change are network interface specific operations usually done via ifconfig.
The request is initiated from the kernel side (in the context of the ifconfig process) and handled
by the user space DPDK application. The application polls the request, calls the application
handler and returns the response back into the kernel space.
The application handlers can be registered upon interface creation or explicitly regis-
tered/unregistered in runtime. This provides flexibility in multiprocess scenarios (where the
KNI is created in the primary process but the callbacks are handled in the secondary one).
The constraint is that a single process can register and handle the requests.
THIRTY
The DPDK is comprised of several libraries. Some of the functions in these libraries can be
safely called from multiple threads simultaneously, while others cannot. This section allows the
developer to take these issues into account when building their own application.
The run-time environment of the DPDK is typically a single thread per logical core. In some
cases, it is not only multi-threaded, but multi-process. Typically, it is best to avoid sharing data
structures between threads and/or processes where possible. Where this is not possible, then
the execution blocks must access the data in a thread- safe manner. Mechanisms such as
atomics or locking can be used that will allow execution blocks to operate serially. However,
this can have an effect on the performance of the application.
Applications operating in the data plane are performance sensitive but certain functions within
those libraries may not be safe to call from multiple threads simultaneously. The hash, LPM
and mempool libraries and RX/TX in the PMD are examples of this.
The hash and LPM libraries are, by design, thread unsafe in order to maintain performance.
However, if required the developer can add layers on top of these libraries to provide thread
safety. Locking is not needed in all situations, and in both the hash and LPM libraries, lookups
of values can be performed in parallel in multiple threads. Adding, removing or modifying
values, however, cannot be done in multiple threads without using locking when a single hash
or LPM table is accessed. Another alternative to locking would be to create multiple instances
of these tables allowing each thread its own copy.
The RX and TX of the PMD are the most critical aspects of a DPDK application and it is
recommended that no locking be used as it will impact performance. Note, however, that these
functions can safely be used from multiple threads when each thread is performing I/O on a
different NIC queue. If multiple threads are to use the same hardware queue on the same NIC
port, then locking, or some other form of mutual exclusion, is necessary.
The ring library is based on a lockless ring-buffer algorithm that maintains its original de-
sign for thread safety. Moreover, it provides high performance for either multi- or single-
consumer/producer enqueue/dequeue operations. The mempool library is based on the DPDK
lockless ring library and therefore is also multi-thread safe.
195
Programmer’s Guide, Release 17.11.10
Outside of the performance sensitive areas described in Section 25.1, the DPDK provides a
thread-safe API for most other libraries. For example, malloc and memzone functions are safe
for use in multi-threaded and multi-process environments.
The setup and configuration of the PMD is not performance sensitive, but is not thread safe
either. It is possible that the multiple read/writes during PMD setup and configuration could be
corrupted in a multi-thread environment. Since this is not performance sensitive, the developer
can choose to add their own layer to provide thread-safe setup and configuration. It is expected
that, in most applications, the initial configuration of the network ports would be done by a
single thread at startup.
It is recommended that DPDK libraries are initialized in the main thread at application startup
rather than subsequently in the forwarding threads. However, the DPDK performs checks to
ensure that libraries are only initialized once. If initialization is attempted more than once, an
error is returned.
In the multi-process case, the configuration information of shared memory will only be initialized
by the master process. Thereafter, both master and secondary processes can allocate/release
any objects of memory that finally rely on rte_malloc or memzones.
The DPDK works almost entirely in Linux user space in polling mode. For certain infrequent
operations, such as receiving a PMD link status change notification, callbacks may be called
in an additional thread outside the main DPDK processing threads. These function callbacks
should avoid manipulating DPDK objects that are also managed by the normal DPDK threads,
and if they need to do so, it is up to the application to provide the appropriate locking or mutual
exclusion restrictions around those objects.
THIRTYONE
The DPDK Event device library is an abstraction that provides the application with features to
schedule events. This is achieved using the PMD architecture similar to the ethdev or cryptodev
APIs, which may already be familiar to the reader.
The eventdev framework introduces the event driven programming model. In a polling model,
lcores poll ethdev ports and associated Rx queues directly to look for a packet. By contrast
in an event driven model, lcores call the scheduler that selects packets for them based on
programmer-specified criteria. The Eventdev library adds support for an event driven program-
ming model, which offers applications automatic multicore scaling, dynamic load balancing,
pipelining, packet ingress order maintenance and synchronization services to simplify applica-
tion packet processing.
By introducing an event driven programming model, DPDK can support both polling and event
driven programming models for packet processing, and applications are free to choose what-
ever model (or combination of the two) best suits their needs.
Step-by-step instructions of the eventdev design is available in the API Walk-through section
later in this document.
The eventdev API represents each event with a generic struct, which contains a payload and
metadata required for scheduling by an eventdev. The rte_event struct is a 16 byte C struc-
ture, defined in libs/librte_eventdev/rte_eventdev.h.
The rte_event structure contains the following metadata fields, which the application fills in to
have the event scheduled as required:
• flow_id - The targeted flow identifier for the enq/deq operation.
• event_type - The source of this event, eg RTE_EVENT_TYPE_ETHDEV or CPU.
• sub_event_type - Distinguishes events inside the application, that have the same
event_type (see above)
• op - This field takes one of the RTE_EVENT_OP_* values, and tells the eventdev about
the status of the event - valid values are NEW, FORWARD or RELEASE.
197
Programmer’s Guide, Release 17.11.10
• sched_type - Represents the type of scheduling that should be performed on this event,
valid values are the RTE_SCHED_TYPE_ORDERED, ATOMIC and PARALLEL.
• queue_id - The identifier for the event queue that the event is sent to.
• priority - The priority of this event, see RTE_EVENT_DEV_PRIORITY.
The rte_event struct contains a union for payload, allowing flexibility in what the actual event
being scheduled is. The payload is a union of the following:
• uint64_t u64
• void *event_ptr
• struct rte_mbuf *mbuf
These three items in a union occupy the same 64 bits at the end of the rte_event structure.
The application can utilize the 64 bits directly by accessing the u64 variable, while the event_ptr
and mbuf are provided as convenience variables. For example the mbuf pointer in the union
can used to schedule a DPDK packet.
31.1.3 Queues
An event queue is a queue containing events that are scheduled by the event device. An event
queue contains events of different flows associated with scheduling types, such as atomic,
ordered, or parallel.
In this case, each stage has a specified scheduling type. The application configures each
queue for a specific type of scheduling, and just enqueues all events to the eventdev. An
example of a PMD of this type is the eventdev software PMD.
The Eventdev API supports the following scheduling types per queue:
• Atomic
• Ordered
• Parallel
Atomic, Ordered and Parallel are load-balanced scheduling types: the output of the queue can
be spread out over multiple CPU cores.
Atomic scheduling on a queue ensures that a single flow is not present on two different CPU
cores at the same time. Ordered allows sending all flows to any core, but the scheduler must
ensure that on egress the packets are returned to ingress order on downstream queue en-
queue. Parallel allows sending all flows to all CPU cores, without any re-ordering guarantees.
There is a SINGLE_LINK flag which allows an application to indicate that only one port will be
connected to a queue. Queues configured with the single-link flag follow a FIFO like structure,
maintaining ordering but it is only capable of being linked to a single port (see below for port
and queue linking details).
31.1.4 Ports
Ports are the points of contact between worker cores and the eventdev. The general use-case
will see one CPU core using one port to enqueue and dequeue events from an eventdev. Ports
are linked to queues in order to retrieve events from those queues (more details in Linking
Queues and Ports below).
This section will introduce the reader to the eventdev API, showing how to create and configure
an eventdev and use it for a two-stage atomic pipeline with a single core for TX. The diagram
below shows the final state of the application after this walk-through:
W1 W1
WN WN
Fig. 31.1: Sample eventdev usage, with RX, two atomic stages and a single-link to TX.
The eventdev library uses vdev options to add devices to the DPDK application. The --vdev
EAL option allows adding eventdev instances to your DPDK application, using the name of the
eventdev PMD as an argument.
For example, to create an instance of the software eventdev scheduler, the following vdev
arguments should be provided to the application EAL command line:
./dpdk_application --vdev="event_sw0"
In the following code, we configure eventdev instance with 3 queues and 6 ports as follows.
The 3 queues consist of 2 Atomic and 1 Single-Link, while the 6 ports consist of 4 workers, 1
RX and 1 TX.
const struct rte_event_dev_config config = {
.nb_event_queues = 3,
.nb_event_ports = 6,
.nb_events_limit = 4096,
.nb_event_queue_flows = 1024,
.nb_event_port_dequeue_depth = 128,
.nb_event_port_enqueue_depth = 128,
};
int err = rte_event_dev_configure(dev_id, &config);
Once the eventdev itself is configured, the next step is to configure queues. This is done
by setting the appropriate values in a queue_conf structure, and calling the setup function.
Repeat this step for each queue, starting from 0 and ending at nb_event_queues -1 from
the event_dev config above.
struct rte_event_queue_conf atomic_conf = {
.schedule_type = RTE_SCHED_TYPE_ATOMIC,
.priority = RTE_EVENT_DEV_PRIORITY_NORMAL,
.nb_atomic_flows = 1024,
.nb_atomic_order_sequences = 1024,
};
int dev_id = 0;
int queue_id = 0;
int err = rte_event_queue_setup(dev_id, queue_id, &atomic_conf);
The remainder of this walk-through assumes that the queues are configured as follows:
• id 0, atomic queue #1
• id 1, atomic queue #2
• id 2, single-link queue
Once queues are set up successfully, create the ports as required. Each port should be set
up with its corresponding port_conf type, worker for worker cores, rx and tx for the RX and TX
cores:
The final step is to “wire up” the ports to the queues. After this, the eventdev is capable of
scheduling events, and when cores request work to do, the correct events are provided to that
core. Note that the RX core takes input from eg: a NIC so it is not linked to any eventdev
queues.
Linking all workers to atomic queues, and the TX core to the single-link queue can be achieved
like this:
uint8_t port_id = 0;
uint8_t atomic_qs[] = {0, 1};
uint8_t single_link_q = 2;
uint8_t tx_port_id = 5;
uin8t_t priority = RTE_EVENT_DEV_PRIORITY_NORMAL;
A single function call tells the eventdev instance to start processing events. Note that all queues
must be linked to for the instance to start, as if any queue is not linked to, enqueuing to that
queue will cause the application to backpressure and eventually stall due to no space in the
eventdev.
int err = rte_event_dev_start(dev_id);
Now that the eventdev is set up, and ready to receive events, the RX core must enqueue some
events into the system for it to schedule. The events to be scheduled are ordinary DPDK
packets, received from an eth_rx_burst() as normal. The following code shows how those
packets can be enqueued into the eventdev:
const uint16_t nb_rx = rte_eth_rx_burst(eth_port, 0, mbufs, BATCH_SIZE);
Now that the RX core has injected events, there is work to be done by the workers. Note that
each worker will dequeue as many events as it can in a burst, process each one individually,
and then burst the packets back into the eventdev.
The worker can lookup the events source from event.queue_id, which should indicate to the
worker what workload needs to be performed on the event. Once done, the worker can update
the event.queue_id to a new value, to send the event to the next stage in the pipeline.
int timeout = 0;
struct rte_event events[BATCH_SIZE];
uint16_t nb_rx = rte_event_dequeue_burst(dev_id, worker_port_id, events, BATCH_SIZE, timeout);
Finally, when the packet is ready for egress or needs to be dropped, we need to inform the
eventdev that the packet is no longer being handled by the application. This can be done by
calling dequeue() or dequeue_burst(), which indicates that the previous burst of packets is no
longer in use by the application.
An event driven worker thread has following typical workflow on fastpath:
while (1) {
rte_event_dequeue_burst(...);
(event processing)
rte_event_enqueue_burst(...);
}
31.3 Summary
The eventdev library allows an application to easily schedule events as it requires, either using
a run-to-completion or pipeline processing model. The queues and ports abstract the logical
functionality of an eventdev, providing the application with a generic method to schedule events.
With the flexible PMD infrastructure applications benefit of improvements in existing eventdevs
and additions of new ones without modification.
THIRTYTWO
The DPDK Eventdev API allows the application to use an event driven programming model for
packet processing. In this model, the application polls an event device port for receiving events
that reference packets instead of polling Rx queues of ethdev ports. Packet transfer between
ethdev and the event device can be supported in hardware or require a software thread to
receive packets from the ethdev port using ethdev poll mode APIs and enqueue these as
events to the event device using the eventdev API. Both transfer mechanisms may be present
on the same platform depending on the particular combination of the ethdev and the event
device.
The Event Ethernet Rx Adapter library is intended for the application code to configure both
transfer mechanisms using a common API. A capability API allows the eventdev PMD to adver-
tise features supported for a given ethdev and allows the application to perform configuration
as per supported features.
This section will introduce the reader to the adapter API. The application has to first instantiate
an adapter which is associated with a single eventdev, next the adapter instance is configured
with Rx queues that are either polled by a SW thread or linked using hardware support. Finally
the adapter is started.
For SW based packet transfers from ethdev to eventdev, the adapter uses a DPDK service
function and the application is also required to assign a core to the service function.
rx_p_conf.new_event_threshold = dev_info.max_num_events;
rx_p_conf.dequeue_depth = dev_info.max_event_port_dequeue_depth;
rx_p_conf.enqueue_depth = dev_info.max_event_port_enqueue_depth;
err = rte_event_eth_rx_adapter_create(id, dev_id, &rx_p_conf);
204
Programmer’s Guide, Release 17.11.10
If the application desires to have finer control of eventdev port allocation and
setup, it can use the rte_event_eth_rx_adapter_create_ext() function. The
rte_event_eth_rx_adapter_create_ext() function is passed a callback function.
The callback function is invoked if the adapter needs to use a service function and
needs to create an event port for it. The callback is expected to fill the struct
rte_event_eth_rx_adapter_conf structure passed to it.
queue_config.rx_queue_flags = 0;
queue_config.ev = ev;
queue_config.servicing_weight = 1;
err = rte_event_eth_rx_adapter_queue_add(id,
eth_dev_id,
0, &queue_config);
queue_config.rx_queue_flags = 0;
if (cap & RTE_EVENT_ETH_RX_ADAPTER_CAP_OVERRIDE_FLOW_ID) {
ev.flow_id = 1;
queue_config.rx_queue_flags =
RTE_EVENT_ETH_RX_ADAPTER_QUEUE_FLOW_ID_VALID;
}
If the adapter uses a service function, the application is required to assign a service core to
the service function as show below.
uint32_t service_id;
if (rte_event_eth_rx_adapter_service_id_get(0, &service_id) == 0)
rte_service_map_lcore_set(service_id, RX_CORE_ID);
THIRTYTHREE
An example of a complex packet processing pipeline with QoS support is shown in the following
figure.
This pipeline can be built using reusable DPDK software libraries. The main blocks implement-
ing QoS in this pipeline are: the policer, the dropper and the scheduler. A functional description
of each block is provided in the following table.
207
Programmer’s Guide, Release 17.11.10
The hierarchical scheduler block, when present, usually sits on the TX side just before the
transmission stage. Its purpose is to prioritize the transmission of packets from different users
and different traffic classes according to the policy specified by the Service Level Agreements
(SLAs) of each network node.
33.2.1 Overview
The hierarchical scheduler block is similar to the traffic manager block used by network proces-
sors that typically implement per flow (or per group of flows) packet queuing and scheduling. It
typically acts like a buffer that is able to temporarily store a large number of packets just before
their transmission (enqueue operation); as the NIC TX is requesting more packets for trans-
mission, these packets are later on removed and handed over to the NIC TX with the packet
selection logic observing the predefined SLAs (dequeue operation).
The hierarchical scheduler is optimized for a large number of packet queues. When only a
small number of queues are needed, message passing queues should be used instead of this
block. See Worst Case Scenarios for Performance for a more detailed discussion.
The scheduling hierarchy is shown in Fig. 33.3. The first level of the hierarchy is the Ethernet
TX port 1/10/40 GbE, with subsequent hierarchy levels defined as subport, pipe, traffic class
and queue.
Typically, each subport represents a predefined group of users, while each pipe represents an
individual user/subscriber. Each traffic class is the representation of a different traffic type with
specific loss rate, delay and jitter requirements, such as voice, video or data transfers. Each
queue hosts packets from one or multiple connections of the same type belonging to the same
user.
The functionality of each hierarchical level is detailed in the following table.
The rte_sched.h file contains configuration functions for port, subport and pipe.
The port scheduler enqueue API is very similar to the API of the DPDK PMD TX function.
int rte_sched_port_enqueue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts
The port scheduler dequeue API is very similar to the API of the DPDK PMD RX function.
int rte_sched_port_dequeue(struct rte_sched_port *port, struct rte_mbuf **pkts, uint32_t n_pkts
Usage Example
/* File "application.c" */
#define N_PKTS_RX 64
#define N_PKTS_TX 48
#define NIC_RX_PORT 0
#define NIC_RX_QUEUE 0
#define NIC_TX_PORT 1
#define NIC_TX_QUEUE 0
/* Initialization */
<initialization code>
/* Runtime */
while (1) {
/* Read packets from NIC RX queue */
33.2.4 Implementation
Running enqueue and dequeue operations for the same output port from different cores is likely
to cause significant impact on scheduler’s performance and it is therefore not recommended.
The port enqueue and dequeue operations share access to the following data structures:
1. Packet descriptors
2. Queue table
3. Queue storage area
4. Bitmap of active queues
The expected drop in performance is due to:
1. Need to make the queue and bitmap operations thread safe, which requires either using
locking primitives for access serialization (for example, spinlocks/ semaphores) or using
atomic primitives for lockless access (for example, Test and Set, Compare And Swap, an
so on). The impact is much higher in the former case.
2. Ping-pong of cache lines storing the shared data structures between the cache hierar-
chies of the two cores (done transparently by the MESI protocol cache coherency CPU
hardware).
Therefore, the scheduler enqueue and dequeue operations have to be run from the same
thread, which allows the queues and the bitmap operations to be non-thread safe and keeps
the scheduler data structures internal to the same core.
Performance Scaling
Scaling up the number of NIC ports simply requires a proportional increase in the number of
CPU cores to be used for traffic scheduling.
Enqueue Pipeline
2. Access the queue structure to identify the write location in the queue array. If the queue
is full, then the packet is discarded.
3. Access the queue array location to store the packet (i.e. write the mbuf pointer).
It should be noted the strong data dependency between these steps, as steps 2 and 3 cannot
start before the result from steps 1 and 2 becomes available, which prevents the processor out
of order execution engine to provide any significant performance optimizations.
Given the high rate of input packets and the large amount of queues, it is expected that the
data structures accessed to enqueue the current packet are not present in the L1 or L2 data
cache of the current core, thus the above 3 memory accesses would result (on average) in L1
and L2 data cache misses. A number of 3 L1/L2 cache misses per packet is not acceptable for
performance reasons.
The workaround is to prefetch the required data structures in advance. The prefetch operation
has an execution latency during which the processor should not attempt to access the data
structure currently under prefetch, so the processor should execute other work. The only other
work available is to execute different stages of the enqueue sequence of operations on other
input packets, thus resulting in a pipelined implementation for the enqueue operation.
Fig. 33.5 illustrates a pipelined implementation for the enqueue operation with 4 pipeline stages
and each stage executing 2 different input packets. No input packet can be part of more than
one pipeline stage at a given time.
Fig. 33.5: Prefetch Pipeline for the Hierarchical Scheduler Enqueue Operation
The congestion management scheme implemented by the enqueue pipeline described above
is very basic: packets are enqueued until a specific queue becomes full, then all the packets
destined to the same queue are dropped until packets are consumed (by the dequeue oper-
ation). This can be improved by enabling RED/WRED as part of the enqueue pipeline which
looks at the queue occupancy and packet priority in order to yield the enqueue/drop decision for
a specific packet (as opposed to enqueuing all packets / dropping all packets indiscriminately).
The sequence of steps to schedule the next packet from the current pipe is:
1. Identify the next active pipe using the bitmap scan operation, prefetch pipe.
2. Read pipe data structure. Update the credits for the current pipe and its subport. Identify
the first active traffic class within the current pipe, select the next queue using WRR,
prefetch queue pointers for all the 16 queues of the current pipe.
3. Read next element from the current WRR queue and prefetch its packet descriptor.
4. Read the packet length from the packet descriptor (mbuf structure). Based on the packet
length and the available credits (of current pipe, pipe traffic class, subport and subport
traffic class), take the go/no go scheduling decision for the current packet.
To avoid the cache misses, the above data structures (pipe, queue, queue array, mbufs) are
prefetched in advance of being accessed. The strategy of hiding the latency of the prefetch
operations is to switch from the current pipe (in grinder A) to another pipe (in grinder B) imme-
diately after a prefetch is issued for the current pipe. This gives enough time to the prefetch
operation to complete before the execution switches back to this pipe (in grinder A).
The dequeue pipe state machine exploits the data presence into the processor cache, therefore
it tries to send as many packets from the same pipe TC and pipe as possible (up to the available
packets and credits) before moving to the next active TC from the same pipe (if any) or to
another active pipe.
Fig. 33.6: Pipe Prefetch State Machine for the Hierarchical Scheduler Dequeue Operation
The output port is modeled as a conveyor belt of byte slots that need to be filled by the sched-
uler with data for transmission. For 10 GbE, there are 1.25 billion byte slots that need to be
filled by the port scheduler every second. If the scheduler is not fast enough to fill the slots, pro-
vided that enough packets and credits exist, then some slots will be left unused and bandwidth
will be wasted.
In principle, the hierarchical scheduler dequeue operation should be triggered by NIC TX.
Usually, once the occupancy of the NIC TX input queue drops below a predefined threshold,
the port scheduler is woken up (interrupt based or polling based, by continuously monitoring
the queue occupancy) to push more packets into the queue.
The scheduler needs to keep track of time advancement for the credit logic, which requires
credit updates based on time (for example, subport and pipe traffic shaping, traffic class upper
limit enforcement, and so on).
Every time the scheduler decides to send a packet out to the NIC TX for transmission, the
scheduler will increment its internal time reference accordingly. Therefore, it is convenient
to keep the internal time reference in units of bytes, where a byte signifies the time duration
required by the physical interface to send out a byte on the transmission medium. This way,
as a packet is scheduled for transmission, the time is incremented with (n + h), where n is the
packet length in bytes and h is the number of framing overhead bytes per packet.
The scheduler needs to align its internal time reference to the pace of the port conveyor belt.
The reason is to make sure that the scheduler does not feed the NIC TX with more bytes than
the line rate of the physical medium in order to prevent packet drop (by the scheduler, due to
the NIC TX input queue being full, or later on, internally by the NIC TX).
The scheduler reads the current time on every dequeue invocation. The CPU time stamp can
be obtained by reading either the Time Stamp Counter (TSC) register or the High Precision
Event Timer (HPET) register. The current CPU time stamp is converted from number of CPU
clocks to number of bytes: time_bytes = time_cycles / cycles_per_byte, where cycles_per_byte
is the amount of CPU cycles that is equivalent to the transmission time for one byte on the wire
(e.g. for a CPU frequency of 2 GHz and a 10GbE port,*cycles_per_byte = 1.6*).
The scheduler maintains an internal time reference of the NIC time. Whenever a packet is
scheduled, the NIC time is incremented with the packet length (including framing overhead).
On every dequeue invocation, the scheduler checks its internal reference of the NIC time
against the current time:
1. If NIC time is in the future (NIC time >= current time), no adjustment of NIC time is
needed. This means that scheduler is able to schedule NIC packets before the NIC
actually needs those packets, so the NIC TX is well supplied with packets;
2. If NIC time is in the past (NIC time < current time), then NIC time should be adjusted by
setting it to the current time. This means that the scheduler is not able to keep up with
the speed of the NIC byte conveyor belt, so NIC bandwidth is wasted due to poor packet
supply to the NIC TX.
The scheduler round trip delay (SRTD) is the time (number of CPU cycles) between two con-
secutive examinations of the same pipe by the scheduler.
To keep up with the output port (that is, avoid bandwidth loss), the scheduler should be able to
schedule n packets faster than the same n packets are transmitted by NIC TX.
The scheduler needs to keep up with the rate of each individual pipe, as configured for the pipe
token bucket, assuming that no port oversubscription is taking place. This means that the size
of the pipe token bucket should be set high enough to prevent it from overflowing due to big
SRTD, as this would result in credit loss (and therefore bandwidth loss) for the pipe.
Credit Logic
Scheduling Decision
The scheduling decision to send next packet from (subport S, pipe P, traffic class TC, queue
Q) is favorable (packet is sent) when all the conditions below are met:
• Pipe P of subport S is currently selected by one of the port grinders;
• Traffic class TC is the highest priority active traffic class of pipe P;
• Queue Q is the next queue selected by WRR within traffic class TC of pipe P;
• Subport S has enough credits to send the packet;
• Subport S has enough credits for traffic class TC to send the packet;
• Pipe P has enough credits to send the packet;
• Pipe P has enough credits for traffic class TC to send the packet.
If all the above conditions are met, then the packet is selected for transmission and the nec-
essary credits are subtracted from subport S, subport S traffic class TC, pipe P, pipe P traffic
class TC.
Framing Overhead
As the greatest common divisor for all packet lengths is one byte, the unit of credit is selected
as one byte. The number of credits required for the transmission of a packet of n bytes is equal
to (n+h), where h is equal to the number of framing overhead bytes per packet.
Traffic Shaping
The traffic shaping for subport and pipe is implemented using a token bucket per subport/per
pipe. Each token bucket is implemented using one saturated counter that keeps track of the
number of available credits.
The token bucket generic parameters and operations are presented in Table 33.6 and Table
33.7.
Traffic Classes
Strict priority scheduling of traffic classes within the same pipe is implemented by the pipe
dequeue state machine, which selects the queues in ascending order. Therefore, queues 0..3
(associated with TC 0, highest priority TC) are handled before queues 4..7 (TC 1, lower priority
than TC 0), which are handled before queues 8..11 (TC 2), which are handled before queues
12..15 (TC 3, lowest priority TC).
The traffic classes at the pipe and subport levels are not traffic shaped, so there is no token
bucket maintained in this context. The upper limit for the traffic classes at the subport and
pipe levels is enforced by periodically refilling the subport / pipe traffic class credit counter, out
of which credits are consumed every time a packet is scheduled for that subport / pipe, as
described in Table 33.10 and Table 33.11.
Table 33.10: Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure
# Subport or pipe field Unit Description
1 tc_time Bytes Time of the next up-
date (upper limit refill)
for the 4 TCs of the
current subport / pipe.
See Section Internal
Time Reference for
the explanation of why
the time is maintained
in byte units.
2 tc_period Bytes Time between two
consecutive updates
for the 4 TCs of the
current subport / pipe.
This is expected to
be many times bigger
than the typical value
of the token bucket
tb_period.
3 tc_credits_per_period Bytes Upper limit for the
number of credits al-
lowed to be consumed
by the current TC dur-
ing each enforcement
period tc_period.
4 tc_credits Bytes Current upper limit for
the number of credits
that can be consumed
by the current traffic
class for the remain-
der of the current en-
forcement period.
The evolution of the WRR design solution from simple to complex is shown in Table 33.12.
Problem Statement
Oversubscription for subport traffic class X is a configuration-time event that occurs when more
bandwidth is allocated for traffic class X at the level of subport member pipes than allocated
for the same traffic class at the parent subport level.
The existence of the oversubscription for a specific subport and traffic class is solely the result
of pipe and subport-level configuration as opposed to being created due to dynamic evolution
of the traffic load at run-time (as congestion is).
When the overall demand for traffic class X for the current subport is low, the existence of
the oversubscription condition does not represent a problem, as demand for traffic class X is
completely satisfied for all member pipes. However, this can no longer be achieved when the
aggregated demand for traffic class X for all subport member pipes exceeds the limit configured
at the subport level.
Solution Space
summarizes some of the possible approaches for handling this problem, with the third approach
selected for implementation.
Implementation Overview
The algorithm computes a watermark, which is periodically updated based on the current de-
mand experienced by the subport member pipes, whose purpose is to limit the amount of traffic
that each pipe is allowed to send for TC 3. The watermark is computed at the subport level at
the beginning of each traffic class upper limit enforcement period and the same value is used
by all the subport member pipes throughout the current enforcement period. illustrates how
the watermark computed as subport level at the beginning of each period is propagated to all
subport member pipes.
At the beginning of the current enforcement period (which coincides with the end of the pre-
vious enforcement period), the value of the watermark is adjusted based on the amount of
bandwidth allocated to TC 3 at the beginning of the previous period that was not left unused
by the subport member pipes at the end of the previous period.
If there was subport TC 3 bandwidth left unused, the value of the watermark for the current
period is increased to encourage the subport member pipes to consume more bandwidth. Oth-
erwise, the value of the watermark is decreased to enforce equality of bandwidth consumption
among subport member pipes for TC 3.
The increase or decrease in the watermark value is done in small increments, so several
enforcement periods might be required to reach the equilibrium state. This state can change
at any moment due to variations in the demand experienced by the subport member pipes for
TC 3, for example, as a result of demand increase (when the watermark needs to be lowered)
or demand decrease (when the watermark needs to be increased).
When demand is low, the watermark is set high to prevent it from impeding the subport member
pipes from consuming more bandwidth. The highest value for the watermark is picked as the
highest rate configured for a subport member pipe. Table 33.14 and Table 33.15 illustrates the
watermark operation.
Table 33.14: Watermark Propagation from Subport Level to Member Pipes at the Beginning of
Each Traffic Class Upper Limit Enforcement Period
No. Subport Traffic Class Opera- Description
tion
1 Initialization Subport level: sub-
port_period_id= 0
Pipe level: pipe_period_id =
0
2 Credit update Subport Level:
if (time>=subport_tc_time)
{ subport_wm = wa-
ter_mark_update();
subport_tc_time = time
+ subport_tc_period;
subport_period_id++;
}
Pipelevel:
if(pipe_period_id != sub-
port_period_id)
{
pipe_ov_credits
= subport_wm *
pipe_weight;
pipe_period_id
= sub-
port_period_id;
}
3 Credit consumption (on Pipe level:
packet scheduling) pkt_credits = pk_len +
frame_overhead;
if(pipe_ov_credits >=
pkt_credits{
pipe_ov_credits -
= pkt_credits;
}
The more queues the scheduler has to examine for packets and credits in order to select one
packet, the lower the performance of the scheduler is.
The scheduler maintains the bitmap of active queues, which skips the non-active queues, but
in order to detect whether a specific pipe has enough credits, the pipe has to be drilled down
using the pipe dequeue state machine, which consumes cycles regardless of the scheduling
result (no packets are produced or at least one packet is produced).
This scenario stresses the importance of the policer for the scheduler performance: if the pipe
does not have enough credits, its packets should be dropped as soon as possible (before they
reach the hierarchical scheduler), thus rendering the pipe queues as not active, which allows
the dequeue side to skip that pipe with no cycles being spent on investigating the pipe credits
that would result in a “not enough credits” status.
The port scheduler performance is optimized for a large number of queues. If the number of
queues is small, then the performance of the port scheduler for the same level of active traffic
is expected to be worse than the performance of a small set of message passing queues.
33.3 Dropper
The purpose of the DPDK dropper is to drop packets arriving at a packet scheduler to avoid
congestion. The dropper supports the Random Early Detection (RED), Weighted Random
Early Detection (WRED) and tail drop algorithms. Fig. 33.7 illustrates how the dropper inte-
grates with the scheduler. The DPDK currently does not support congestion management so
the dropper provides the only method for congestion avoidance.
The dropper uses the Random Early Detection (RED) congestion avoidance algorithm as doc-
umented in the reference publication. The purpose of the RED algorithm is to monitor a packet
queue, determine the current congestion level in the queue and decide whether an arriving
packet should be enqueued or dropped. The RED algorithm uses an Exponential Weighted
Moving Average (EWMA) filter to compute average queue size which gives an indication of the
current congestion level in the queue.
For each enqueue operation, the RED algorithm compares the average queue size to minimum
and maximum thresholds. Depending on whether the average queue size is below, above or in
between these thresholds, the RED algorithm calculates the probability that an arriving packet
should be dropped and makes a random decision based on this probability.
The dropper also supports Weighted Random Early Detection (WRED) by allowing the sched-
uler to select different RED configurations for the same packet queue at run-time. In the case
of severe congestion, the dropper resorts to tail drop. This occurs when a packet queue has
reached maximum capacity and cannot store any more packets. In this situation, all arriving
packets are dropped.
The flow through the dropper is illustrated in Fig. 33.8. The RED/WRED algorithm is exercised
first and tail drop second.
The use cases supported by the dropper are:
• – Initialize configuration data
• – Initialize run-time data
• – Enqueue (make a decision to enqueue or drop an arriving packet)
• – Mark empty (record the time at which a packet queue becomes empty)
The configuration use case is explained in Section2.23.3.1, the enqueue operation is explained
in Section 2.23.3.2 and the mark empty operation is explained in Section 2.23.3.3.
33.3.1 Configuration
In the example shown in Fig. 33.9, q (actual queue size) is the input value, avg (average queue
size) and count (number of packets since the last drop) are run-time values, decision is the
The purpose of the EWMA Filter microblock is to filter queue size values to smooth out transient
changes that result from “bursty” traffic. The output value is the average queue size which gives
a more stable view of the current congestion level in the queue.
The EWMA filter has one configuration parameter, filter weight, which determines how quickly
or slowly the average queue size output responds to changes in the actual queue size input.
Higher values of filter weight mean that the average queue size responds more quickly to
changes in actual queue size.
Where:
• avg = average queue size
• wq = filter weight
• q = actual queue size
Note:
The filter weight, wq = 1/2^n, where n is the filter weight parameter value passed to the dropper modu
on configuration (see Section2.23.3.1 ).
The EWMA filter does not read time stamps and instead assumes that enqueue operations
will happen quite regularly. Special handling is required when the queue becomes empty as
the queue could be empty for a short time or a long time. When the queue becomes empty,
average queue size should decay gradually to zero instead of dropping suddenly to zero or
remaining stagnant at the last computed value. When a packet is enqueued on an empty
queue, the average queue size is computed using the following formula:
Where:
• m = the number of enqueue operations that could have occurred on this queue while the
queue was empty
In the dropper module, m is defined as:
Where:
• time = current time
• qtime = time the queue became empty
• s = typical time between successive enqueue operations on this queue
The time reference is in units of bytes, where a byte signifies the time duration required by the
physical interface to send out a byte on the transmission medium (see Section Internal Time
Reference). The parameter s is defined in the dropper module as a constant with the value:
s=2^22. This corresponds to the time required by every leaf node in a hierarchy with 64K leaf
nodes to transmit one 64-byte packet onto the wire and represents the worst case scenario.
For much smaller scheduler hierarchies, it may be necessary to reduce the parameter s, which
is defined in the red header source file (rte_red.h) as:
#define RTE_RED_S
Since the time reference is in bytes, the port speed is implied in the expression: time-qtime.
The dropper does not have to be configured with the actual port speed. It adjusts automatically
to low speed and high speed links.
Implementation
A numerical method is used to compute the factor (1-wq)^m that appears in Equation 2.
This method is based on the following identity:
In the dropper module, a look-up table is used to compute log2(1-wq) for each value of wq
supported by the dropper module. The factor (1-wq)^m can then be obtained by multiplying
the table value by m and applying shift operations. To avoid overflow in the multiplication, the
value, m, and the look-up table values are limited to 16 bits. The total size of the look-up table
is 56 bytes. Once the factor (1-wq)^m is obtained using this method, the average queue size
can be calculated from Equation 2.
Alternative Approaches
Other methods for calculating the factor (1-wq)^m in the expression for computing average
queue size when the queue is empty (Equation 2) were considered. These approaches include:
• Floating-point evaluation
• Fixed-point evaluation using a small look-up table (512B) and up to 16 multiplications
(this is the approach used in the FreeBSD* ALTQ RED implementation)
• Fixed-point evaluation using a small look-up table (512B) and 16 SSE multiplications
(SSE optimized version of the approach used in the FreeBSD* ALTQ RED implementa-
tion)
• Large look-up table (76 KB)
The method that was finally selected (described above in Section 26.3.2.2.1) out performs all
of these approaches in terms of run-time performance and memory requirements and also
achieves accuracy comparable to floating-point evaluation. Table 33.17 lists the performance
of each of these alternative approaches relative to the method that is used in the dropper. As
can be seen, the floating-point implementation achieved the worst performance.
The calculation of the drop probability occurs in two stages. An initial drop probability is calcu-
lated based on the average queue size, the minimum and maximum thresholds and the mark
probability. An actual drop probability is then computed from the initial drop probability. The
actual drop probability takes the count run-time value into consideration so that the actual drop
probability increases as more packets arrive to the packet queue since the last packet was
dropped.
Where:
• maxp = mark probability
• avg = average queue size
• minth = minimum threshold
• maxth = maximum threshold
The calculation of the packet drop probability using Equation 3 is illustrated in Fig. 33.10. If
the average queue size is below the minimum threshold, an arriving packet is enqueued. If the
average queue size is at or above the maximum threshold, an arriving packet is dropped. If
the average queue size is between the minimum and maximum thresholds, a drop probability
is calculated to determine if the packet should be enqueued or dropped.
If the average queue size is between the minimum and maximum thresholds, then the actual
drop probability is calculated from the following equation.
Where:
• Pb = initial drop probability (from Equation 3)
• count = number of packets that have arrived since the last drop
The constant 2, in Equation 4 is the only deviation from the drop probability formulae given in
the reference document where a value of 1 is used instead. It should be noted that the value pa
computed from can be negative or greater than 1. If this is the case, then a value of 1 should
be used instead.
The initial and actual drop probabilities are shown in Fig. 33.11. The actual drop probabil-
ity is shown for the case where the formula given in the reference document1 is used (blue
curve) and also for the case where the formula implemented in the dropper module, is used
(red curve). The formula in the reference document results in a significantly higher drop rate
compared to the mark probability configuration parameter specified by the user. The choice to
deviate from the reference document is simply a design decision and one that has been taken
by other RED implementations, for example, FreeBSD* ALTQ RED.
Fig. 33.11: Initial Drop Probability (pb), Actual Drop probability (pa) Computed Using a Factor
1 (Blue Curve) and a Factor 2 (Red Curve)
The time at which a packet queue becomes empty must be recorded and saved with the RED
run-time data so that the EWMA filter block can calculate the average queue size on the next
enqueue operation. It is the responsibility of the calling application to inform the dropper mod-
ule through the API that a queue has become empty.
The source files for the DPDK dropper are located at:
• DPDK/lib/librte_sched/rte_red.h
• DPDK/lib/librte_sched/rte_red.c
RED functionality in the DPDK QoS scheduler is disabled by default. To enable it, use the
DPDK configuration parameter:
CONFIG_RTE_SCHED_RED=y
This parameter must be set to y. The parameter is found in the build configuration files in
the DPDK/config directory, for example, DPDK/config/common_linuxapp. RED configuration
parameters are specified in the rte_red_params structure within the rte_sched_port_params
structure that is passed to the scheduler on initialization. RED parameters are specified sep-
arately for four traffic classes and three packet colors (green, yellow and red) allowing the
scheduler to implement Weighted Random Early Detection (WRED).
The DPDK QoS Scheduler Application reads a configuration file on start-up. The configura-
tion file includes a section containing RED parameters. The format of these parameters is
described in Section2.23.3.1. A sample RED configuration is shown below. In this example,
the queue size is 64 packets.
Note: For correct operation, the same EWMA filter weight parameter (wred weight) should be
used for each packet color (green, yellow, red) in the same traffic class (tc).
; RED params per traffic class and color (Green / Yellow / Red)
[red]
tc 0 wred min = 28 22 16
tc 0 wred max = 32 32 32
tc 0 wred inv prob = 10 10 10
tc 0 wred weight = 9 9 9
tc 1 wred min = 28 22 16
tc 1 wred max = 32 32 32
tc 1 wred inv prob = 10 10 10
tc 1 wred weight = 9 9 9
tc 2 wred min = 28 22 16
tc 2 wred max = 32 32 32
tc 2 wred inv prob = 10 10 10
tc 2 wred weight = 9 9 9
tc 3 wred min = 28 22 16
tc 3 wred max = 32 32 32
tc 3 wred inv prob = 10 10 10
tc 3 wred weight = 9 9 9
With this configuration file, the RED configuration that applies to green, yellow and red packets
in traffic class 0 is shown in Table 33.18.
Enqueue API
The arguments passed to the enqueue API are configuration data, run-time data, the current
size of the packet queue (in packets) and a value representing the current time. The time
reference is in units of bytes, where a byte signifies the time duration required by the physical
interface to send out a byte on the transmission medium (see Section 26.2.4.5.1 “Internal Time
Reference” ). The dropper reuses the scheduler time stamps for performance reasons.
Empty API
The arguments passed to the empty API are run-time data and the current time in bytes.
The traffic metering component implements the Single Rate Three Color Marker (srTCM) and
Two Rate Three Color Marker (trTCM) algorithms, as defined by IETF RFC 2697 and 2698
respectively. These algorithms meter the stream of incoming packets based on the allowance
defined in advance for each traffic flow. As result, each incoming packet is tagged as green,
yellow or red based on the monitored consumption of the flow the packet belongs to.
The srTCM algorithm defines two token buckets for each traffic flow, with the two buckets
sharing the same token update rate:
• Committed (C) bucket: fed with tokens at the rate defined by the Committed Information
Rate (CIR) parameter (measured in IP packet bytes per second). The size of the C bucket
is defined by the Committed Burst Size (CBS) parameter (measured in bytes);
• Excess (E) bucket: fed with tokens at the same rate as the C bucket. The size of the E
bucket is defined by the Excess Burst Size (EBS) parameter (measured in bytes).
The trTCM algorithm defines two token buckets for each traffic flow, with the two buckets being
updated with tokens at independent rates:
• Committed (C) bucket: fed with tokens at the rate defined by the Committed Information
Rate (CIR) parameter (measured in bytes of IP packet per second). The size of the C
bucket is defined by the Committed Burst Size (CBS) parameter (measured in bytes);
• Peak (P) bucket: fed with tokens at the rate defined by the Peak Information Rate (PIR)
parameter (measured in IP packet bytes per second). The size of the P bucket is defined
by the Peak Burst Size (PBS) parameter (measured in bytes).
Please refer to RFC 2697 (for srTCM) and RFC 2698 (for trTCM) for details on how tokens are
consumed from the buckets and how the packet color is determined.
For both algorithms, the color blind mode is functionally equivalent to the color aware mode
with input color set as green. For color aware mode, a packet with red input color can only get
the red output color, while a packet with yellow input color can only get the yellow or red output
colors.
The reason why the color blind mode is still implemented distinctly than the color aware mode
is that color blind mode can be implemented with fewer operations than the color aware mode.
For each input packet, the steps for the srTCM / trTCM algorithms are:
• Update the C and E / P token buckets. This is done by reading the current time (from
the CPU timestamp counter), identifying the amount of time since the last bucket update
and computing the associated number of tokens (according to the pre-configured bucket
rate). The number of tokens in the bucket is limited by the pre-configured bucket size;
• Identify the output color for the current packet based on the size of the IP packet and the
amount of tokens currently available in the C and E / P buckets; for color aware mode
only, the input color of the packet is also considered. When the output color is not red, a
number of tokens equal to the length of the IP packet are subtracted from the C or E /P
or both buckets, depending on the algorithm and the output color of the packet.
THIRTYFOUR
POWER MANAGEMENT
The DPDK Power Management feature allows users space applications to save power by dy-
namically adjusting CPU frequency or entering into different C-States.
• Adjusting the CPU frequency dynamically according to the utilization of RX queue.
• Entering into different deeper C-States according to the adaptive algorithms to speculate
brief periods of time suspending the application if no packets are received.
The interfaces for adjusting the operating CPU frequency are in the power management library.
C-State control is implemented in applications according to the different use cases.
The Linux kernel provides a cpufreq module for CPU frequency scaling for each lcore. For
example, for cpuX, /sys/devices/system/cpu/cpuX/cpufreq/ has the following sys files for fre-
quency scaling:
• affected_cpus
• bios_limit
• cpuinfo_cur_freq
• cpuinfo_max_freq
• cpuinfo_min_freq
• cpuinfo_transition_latency
• related_cpus
• scaling_available_frequencies
• scaling_available_governors
• scaling_cur_freq
• scaling_driver
• scaling_governor
• scaling_max_freq
• scaling_min_freq
• scaling_setspeed
245
Programmer’s Guide, Release 17.11.10
In the DPDK, scaling_governor is configured in user space. Then, a user space application
can prompt the kernel by writing scaling_setspeed to adjust the CPU frequency according to
the strategies defined by the user space application.
Core state can be altered by speculative sleeps whenever the specified lcore has nothing to
do. In the DPDK, if no packet is received after polling, speculative sleeps can be triggered
according the strategies defined by the user space application.
Individual cores can be allowed to enter a Turbo Boost state on a per-core basis. This is
achieved by enabling Turbo Boost Technology in the BIOS, then looping through the relevant
cores and enabling/disabling Turbo Boost on each core.
The main methods exported by power library are for CPU frequency scaling and include the
following:
• Freq up: Prompt the kernel to scale up the frequency of the specific lcore.
• Freq down: Prompt the kernel to scale down the frequency of the specific lcore.
• Freq max: Prompt the kernel to scale up the frequency of the specific lcore to the maxi-
mum.
• Freq min: Prompt the kernel to scale down the frequency of the specific lcore to the
minimum.
• Get available freqs: Read the available frequencies of the specific lcore from the sys
file.
• Freq get: Get the current frequency of the specific lcore.
• Freq set: Prompt the kernel to set the frequency for the specific lcore.
• Enable turbo: Prompt the kernel to enable Turbo Boost for the specific lcore.
• Disable turbo: Prompt the kernel to disable Turbo Boost for the specific lcore.
The power management mechanism is used to save power when performing L3 forwarding.
34.6 References
THIRTYFIVE
The DPDK provides an Access Control library that gives the ability to classify an input packet
based on a set of classification rules.
The ACL library is used to perform an N-tuple search over a set of rules with multiple categories
and find the best match (highest priority) for each category. The library API provides the
following basic operations:
• Create a new Access Control (AC) context.
• Add rules into the context.
• For all rules in the context, build the runtime structures necessary to perform packet
classification.
• Perform input packet classifications.
• Destroy an AC context and its runtime structures and free the associated memory.
35.1 Overview
The current implementation allows the user for each AC context to specify its own rule (set of
fields) over which packet classification will be performed. Though there are few restrictions on
the rule fields layout:
• First field in the rule definition has to be one byte long.
• All subsequent fields has to be grouped into sets of 4 consecutive bytes.
This is done mainly for performance reasons - search function processes the first input byte as
part of the flow setup and then the inner loop of the search function is unrolled to process four
input bytes at a time.
To define each field inside an AC rule, the following structure is used:
struct rte_acl_field_def {
uint8_t type; /*< type - ACL_FIELD_TYPE. */
uint8_t size; /*< size of field 1,2,4, or 8. */
uint8_t field_index; /*< index of field inside the rule. */
uint8_t input_index; /*< 0-N input index. */
uint32_t offset; /*< offset to start of field. */
};
248
Programmer’s Guide, Release 17.11.10
– _MASK - for fields such as IP addresses that have a value and a mask defining the
number of relevant bits.
– _RANGE - for fields such as ports that have a lower and upper value for the field.
– _BITMASK - for fields such as protocol identifiers that have a value and a bit mask.
• size The size parameter defines the length of the field in bytes. Allowable values are 1,
2, 4, or 8 bytes. Note that due to the grouping of input bytes, 1 or 2 byte fields must be
defined as consecutive fields that make up 4 consecutive input bytes. Also, it is best to
define fields of 8 or more bytes as 4 byte fields so that the build processes can eliminate
fields that are all wild.
• field_index A zero-based value that represents the position of the field inside the rule; 0
to N-1 for N fields.
• input_index As mentioned above, all input fields, except the very first one, must be in
groups of 4 consecutive bytes. The input index specifies to which input group that field
belongs to.
• offset The offset field defines the offset for the field. This is the offset from the beginning
of the buffer parameter for the search.
For example, to define classification for the following IPv4 5-tuple structure:
struct ipv4_5tuple {
uint8_t proto;
uint32_t ip_src;
uint32_t ip_dst;
uint16_t port_src;
uint16_t port_dst;
};
/*
* Next 2 fields (src & dst ports) form 4 consecutive bytes.
* They share the same input index.
*/
{
.type = RTE_ACL_FIELD_TYPE_RANGE,
.size = sizeof (uint16_t),
.field_index = 3,
.input_index = 3,
.offset = offsetof (struct ipv4_5tuple, port_src),
},
{
.type = RTE_ACL_FIELD_TYPE_RANGE,
.size = sizeof (uint16_t),
.field_index = 4,
.input_index = 3,
.offset = offsetof (struct ipv4_5tuple, port_dst),
},
};
Any IPv4 packets with protocol ID 17 (UDP), source address 192.168.1.[0-255], destination
address 192.168.2.31, source port [0-65535] and destination port 1234 matches the above
rule.
To define classification for the IPv6 2-tuple: <protocol, IPv6 source address> over the following
IPv6 header structure:
struct ipv6_hdr {
uint32_t vtc_flow; /* IP version, traffic class & flow label. */
uint16_t payload_len; /* IP packet length - includes sizeof(ip_header). */
uint8_t proto; /* Protocol, next header. */
uint8_t hop_limits; /* Hop limits. */
uint8_t src_addr[16]; /* IP address of source host. */
uint8_t dst_addr[16]; /* IP address of destination host(s). */
} __attribute__((__packed__));
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 1,
.input_index = 1,
.offset = offsetof (struct ipv6_hdr, src_addr[0]),
},
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 2,
.input_index = 2,
.offset = offsetof (struct ipv6_hdr, src_addr[4]),
},
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 3,
.input_index = 3,
.offset = offsetof (struct ipv6_hdr, src_addr[8]),
},
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 4,
.input_index = 4,
.offset = offsetof (struct ipv6_hdr, src_addr[12]),
},
};
Any IPv6 packets with protocol ID 6 (TCP), and source address inside the range
[2001:db8:1234:0000:0000:0000:0000:0000 - 2001:db8:1234:ffff:ffff:ffff:ffff:ffff] matches the
above rule.
In the following example the last element of the search key is 8-bit long. So it is a case
where the 4 consecutive bytes of an input field are not fully occupied. The structure for the
classification is:
struct acl_key {
uint8_t ip_proto;
uint32_t ip_src;
uint32_t ip_dst;
uint8_t tos; /*< This is partially using a 32-bit input element */
};
{
.type = RTE_ACL_FIELD_TYPE_MASK,
.size = sizeof (uint32_t),
.field_index = 2,
.input_index = 2,
.offset = offsetof (struct acl_key, ip_dst),
},
/*
* Next element of search key (Type of Service) is indeed 1 byte long.
* Anyway we need to allocate all the 4 consecutive bytes for it.
*/
{
.type = RTE_ACL_FIELD_TYPE_BITMASK,
.size = sizeof (uint32_t), /* All the 4 consecutive bytes are allocated */
.field_index = 3,
.input_index = 3,
.offset = offsetof (struct acl_key, tos),
},
};
Any IPv4 packets with protocol ID 6 (TCP), source address 192.168.1.[0-255], destination
address 192.168.2.31, ToS 1 matches the above rule.
When creating a set of rules, for each rule, additional information must be supplied also:
• priority: A weight to measure the priority of the rules (higher is better). If the input tuple
matches more than one rule, then the rule with the higher priority is returned. Note that
if the input tuple matches more than one rule and these rules have equal priority, it is
undefined which rule is returned as a match. It is recommended to assign a unique
priority for each rule.
• category_mask: Each rule uses a bit mask value to select the relevant category(s) for
the rule. When a lookup is performed, the result for each category is returned. This ef-
fectively provides a “parallel lookup” by enabling a single search to return multiple results
if, for example, there were four different sets of ACL rules, one for access control, one for
routing, and so on. Each set could be assigned its own category and by combining them
into a single database, one lookup returns a result for each of the four sets.
• userdata: A user-defined value. For each category, a successful match returns the
userdata field of the highest priority matched rule. When no rules match, returned value
is zero.
Note: When adding new rules into an ACL context, all fields must be in host byte order (LSB).
When the search is performed for an input tuple, all fields in that tuple must be in network byte
order (MSB).
Build phase (rte_acl_build()) creates for a given set of rules internal structure for further run-
time traversal. With current implementation it is a set of multi-bit tries (with stride == 8).
Depending on the rules set, that could consume significant amount of memory. In attempt
to conserve some space ACL build process tries to split the given rule-set into several non-
intersecting subsets and construct a separate trie for each of them. Depending on the rule-set,
it might reduce RT memory requirements but might increase classification time. There is a
possibility at build-time to specify maximum memory limit for internal RT structures for given
AC context. It could be done via max_size field of the rte_acl_config structure. Setting it to
the value greater than zero, instructs rte_acl_build() to:
• attempt to minimize number of tries in the RT table, but
• make sure that size of RT table wouldn’t exceed given value.
Setting it to zero makes rte_acl_build() to use the default behavior: try to minimize size of the
RT structures, but doesn’t expose any hard limit on it.
That gives the user the ability to decisions about performance/space trade-off. For example:
struct rte_acl_ctx * acx;
struct rte_acl_config cfg;
int ret;
/*
* assuming that acx points to already created and
* populated with rules AC context and cfg filled properly.
*/
/*
* RT structures can't fit into 8MB for given context.
* Try to build without exposing any hard limit.
*/
if (ret == -ERANGE) {
cfg.max_size = 0;
ret = rte_acl_build(acx, &cfg);
}
After rte_acl_build() over given AC context has finished successfully, it can be used to perform
classification - search for a rule with highest priority over the input data. There are several
implementations of classify algorithm:
• RTE_ACL_CLASSIFY_SCALAR: generic implementation, doesn’t require any specific
HW support.
• RTE_ACL_CLASSIFY_SSE: vector implementation, can process up to 8 flows in paral-
lel. Requires SSE 4.1 support.
• RTE_ACL_CLASSIFY_AVX2: vector implementation, can process up to 16 flows in par-
allel. Requires AVX2 support.
It is purely a runtime decision which method to choose, there is no build-time difference. All
implementations operates over the same internal RT structures and use similar principles. The
main difference is that vector implementations can manually exploit IA SIMD instructions and
process several input data flows in parallel. At startup ACL library determines the highest
available classify method for the given platform and sets it as default one. Though the user has
an ability to override the default classifier function for a given ACL context or perform particular
search using non-default classify method. In that case it is user responsibility to make sure
that given platform supports selected classify implementation.
Note: For more details about the Access Control API, please refer to the DPDK API Refer-
ence.
The following example demonstrates IPv4, 5-tuple classification for rules defined above with
multiple categories in more detail.
RTE_ACL_RULE_DEF(acl_ipv4_rule, RTE_DIM(ipv4_defs));
/* destination IPv4 */
.field[2] = {.value.u32 = IPv4(192,168,0,0),. mask_range.u32 = 16,},
/* source port */
.field[3] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
/* destination port */
.field[4] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
},
/* destination IPv4 */
.field[2] = {.value.u32 = IPv4(192,168,1,0),. mask_range.u32 = 24,},
/* source port */
.field[3] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
/* destination port */
.field[4] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
},
/* source IPv4 */
.field[1] = {.value.u32 = IPv4(10,1,1,1),. mask_range.u32 = 32,},
/* source port */
.field[3] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
/* destination port */
.field[4] = {.value.u16 = 0, .mask_range.u16 = 0xffff,},
},
};
cfg.num_categories = 2;
cfg.num_fields = RTE_DIM(ipv4_defs);
For a tuple with source IP address: 10.1.1.1 and destination IP address: 192.168.1.15, once
the following lines are executed:
uint32_t results[4]; /* make classify for 4 categories. */
• For category 0, both rules 1 and 2 match, but rule 2 has higher priority, therefore results[0]
contains the userdata for rule 2.
• For category 1, both rules 1 and 3 match, but rule 3 has higher priority, therefore results[1]
contains the userdata for rule 3.
• For categories 2 and 3, there are no matches, so results[2] and results[3] contain zero,
which indicates that no matches were found for those categories.
For a tuple with source IP address: 192.168.1.1 and destination IP address: 192.168.2.11,
once the following lines are executed:
uint32_t results[4]; /* make classify by 4 categories. */
THIRTYSIX
PACKET FRAMEWORK
The main design objectives for the DPDK Packet Framework are:
• Provide standard methodology to build complex packet processing pipelines. Provide
reusable and extensible templates for the commonly used pipeline functional blocks;
• Provide capability to switch between pure software and hardware-accelerated implemen-
tations for the same pipeline functional block;
• Provide the best trade-off between flexibility and performance. Hardcoded pipelines usu-
ally provide the best performance, but are not flexible, while developing flexible frame-
works is never a problem, but performance is usually low;
• Provide a framework that is logically similar to Open Flow.
36.2 Overview
Packet processing applications are frequently structured as pipelines of multiple stages, with
the logic of each stage glued around a lookup table. For each incoming packet, the table
defines the set of actions to be applied to the packet, as well as the next stage to send the
packet to.
The DPDK Packet Framework minimizes the development effort required to build packet pro-
cessing pipelines by defining a standard methodology for pipeline development, as well as
providing libraries of reusable templates for the commonly used pipeline blocks.
The pipeline is constructed by connecting the set of input ports with the set of output ports
through the set of tables in a tree-like topology. As result of lookup operation for the current
packet in the current table, one of the table entries (on lookup hit) or the default table entry (on
lookup miss) provides the set of actions to be applied on the current packet, as well as the next
hop for the packet, which can be either another table, an output port or packet drop.
An example of packet processing pipeline is presented in Fig. 36.1:
257
Programmer’s Guide, Release 17.11.10
Fig. 36.1: Example of Packet Processing Pipeline where Input Ports 0 and 1 are Connected
with Output Ports 0, 1 and 2 through Tables 0 and 1
Table 36.1 is a non-exhaustive list of ports that can be implemented with the Packet Framework.
Each port is unidirectional, i.e. either input port or output port. Each input/output port is
required to implement an abstract interface that defines the initialization and run-time operation
of the port. The port abstract interface is described in.
Table 36.3 is a non-exhaustive list of types of tables that can be implemented with the Packet
Framework.
Each table is required to implement an abstract interface that defines the initialization and
run-time operation of the table. The table abstract interface is described in Table 36.4.
Hash tables are important because the key lookup operation is optimized for speed: instead of
having to linearly search the lookup key through all the keys in the table, the search is limited
to only the keys stored in a single table bucket.
Associative Arrays
An associative array is a function that can be specified as a set of (key, value) pairs, with each
key from the possible set of input keys present at most once. For a given associative array, the
possible operations are:
1. add (key, value): When no value is currently associated with key, then the (key, value ) as-
sociation is created. When key is already associated value value0, then the association
(key, value0) is removed and association (key, value) is created;
2. delete key : When no value is currently associated with key, this operation has no effect.
When key is already associated value, then association (key, value) is removed;
3. lookup key : When no value is currently associated with key, then this operation returns
void value (lookup miss). When key is associated with value, then this operation returns
value. The (key, value) association is not changed.
The matching criterion used to compare the input key against the keys in the associative array
is exact match, as the key size (number of bytes) and the key value (array of bytes) have to
match exactly for the two keys under comparison.
Hash Function
A hash function deterministically maps data of variable length (key) to data of fixed size (hash
value or key signature). Typically, the size of the key is bigger than the size of the key signature.
The hash function basically compresses a long key into a short signature. Several keys can
share the same signature (collisions).
High quality hash functions have uniform distribution. For large number of keys, when dividing
the space of signature values into a fixed number of equal intervals (buckets), it is desirable
to have the key signatures evenly distributed across these intervals (uniform distribution), as
opposed to most of the signatures going into only a few of the intervals and the rest of the
intervals being largely unused (non-uniform distribution).
Hash Table
A hash table is an associative array that uses a hash function for its operation. The reason for
using a hash function is to optimize the performance of the lookup operation by minimizing the
number of table keys that have to be compared against the input key.
Instead of storing the (key, value) pairs in a single list, the hash table maintains multiple lists
(buckets). For any given key, there is a single bucket where that key might exist, and this bucket
is uniquely identified based on the key signature. Once the key signature is computed and the
hash table bucket identified, the key is either located in this bucket or it is not present in the
hash table at all, so the key search can be narrowed down from the full set of keys currently in
the table to just the set of keys currently in the identified table bucket.
The performance of the hash table lookup operation is greatly improved, provided that the table
keys are evenly distributed among the hash table buckets, which can be achieved by using a
hash function with uniform distribution. The rule to map a key to its bucket can simply be to
use the key signature (modulo the number of table buckets) as the table bucket ID:
bucket_id = f_hash(key) % n_buckets;
By selecting the number of buckets to be a power of two, the modulo operator can be replaced
by a bitwise AND logical operation:
bucket_id = f_hash(key) & (n_buckets - 1);
considering n_bits as the number of bits set in bucket_mask = n_buckets - 1, this means that
all the keys that end up in the same hash table bucket have the lower n_bits of their signature
identical. In order to reduce the number of keys in the same bucket (collisions), the number of
hash table buckets needs to be increased.
In packet processing context, the sequence of operations involved in hash table operations is
described in Fig. 36.2:
Fig. 36.2: Sequence of Steps for Hash Table Operations in a Packet Processing Context
Flow Classification
Description: The flow classification is executed at least once for each input packet. This oper-
ation maps each incoming packet against one of the known traffic flows in the flow database
that typically contains millions of flows.
Hash table name: Flow classification table
Number of keys: Millions
Key format: n-tuple of packet fields that uniquely identify a traffic flow/connection. Example:
DiffServ 5-tuple of (Source IP address, Destination IP address, L4 protocol, L4 protocol source
port, L4 protocol destination port). For IPv4 protocol and L4 protocols like TCP, UDP or SCTP,
the size of the DiffServ 5-tuple is 13 bytes, while for IPv6 it is 37 bytes.
Key value (key data): actions and action meta-data describing what processing to be applied
for the packets of the current flow. The size of the data associated with each traffic flow can
vary from 8 bytes to kilobytes.
Address Resolution Protocol (ARP)
Description: Once a route has been identified for an IP packet (so the output interface and
the IP address of the next hop station are known), the MAC address of the next hop station is
needed in order to send this packet onto the next leg of the journey towards its destination (as
identified by its destination IP address). The MAC address of the next hop station becomes
the destination MAC address of the outgoing Ethernet frame.
Hash table name: ARP table
Number of keys: Thousands
Key format: The pair of (Output interface, Next Hop IP address), which is typically 5 bytes for
IPv4 and 17 bytes for IPv6.
Key value (key data): MAC address of the next hop station (6 bytes).
Table 36.5 lists the hash table configuration parameters shared by all different hash table types.
Table 36.5: Configuration Parameters Common for All Hash Table Types
# Parameter Details
1 Key size Measured as number of bytes. All keys have the same size.
2 Key value (key Measured as number of bytes.
data) size
3 Number of buckets Needs to be a power of two.
4 Maximum number Needs to be a power of two.
of keys
5 Hash function Examples: jhash, CRC hash, etc.
6 Hash function Parameter to be passed to the hash function.
seed
7 Key offset Offset of the lookup key byte array within the packet meta-data
stored in the packet buffer.
On initialization, each hash table bucket is allocated space for exactly 4 keys. As keys are
added to the table, it can happen that a given bucket already has 4 keys when a new key has
to be added to this bucket. The possible options are:
1. Least Recently Used (LRU) Hash Table. One of the existing keys in the bucket is
deleted and the new key is added in its place. The number of keys in each bucket never
grows bigger than 4. The logic to pick the key to be dropped from the bucket is LRU. The
hash table lookup operation maintains the order in which the keys in the same bucket are
hit, so every time a key is hit, it becomes the new Most Recently Used (MRU) key, i.e.
the last candidate for drop. When a key is added to the bucket, it also becomes the new
MRU key. When a key needs to be picked and dropped, the first candidate for drop, i.e.
the current LRU key, is always picked. The LRU logic requires maintaining specific data
structures per each bucket.
2. Extendable Bucket Hash Table. The bucket is extended with space for 4 more keys.
This is done by allocating additional memory at table initialization time, which is used to
create a pool of free keys (the size of this pool is configurable and always a multiple of 4).
On key add operation, the allocation of a group of 4 keys only happens successfully within
the limit of free keys, otherwise the key add operation fails. On key delete operation, a
group of 4 keys is freed back to the pool of free keys when the key to be deleted is the
only key that was used within its group of 4 keys at that time. On key lookup operation,
if the current bucket is in extended state and a match is not found in the first group of 4
keys, the search continues beyond the first group of 4 keys, potentially until all keys in
this bucket are examined. The extendable bucket logic requires maintaining specific data
structures per table and per each bucket.
Signature Computation
Table 36.7: Configuration Parameters Specific to Pre-computed Key Signature Hash Table
# Parameter Details
1 Signature Offset of the pre-computed key signature within the packet
offset meta-data.
For specific key sizes, the data structures and algorithm of key lookup operation can be spe-
cially handcrafted for further performance improvements, so following options are possible:
1. Implementation supporting configurable key size.
2. Implementation supporting a single key size. Typical key sizes are 8 bytes and 16
bytes.
The performance of the bucket search logic is one of the main factors influencing the perfor-
mance of the key lookup operation. The data structures and algorithm are designed to make
the best use of Intel CPU architecture resources like: cache memory space, cache memory
bandwidth, external memory bandwidth, multiple execution units working in parallel, out of
order instruction execution, special CPU instructions, etc.
The bucket search logic handles multiple input packets in parallel. It is built as a pipeline of
several stages (3 or 4), with each pipeline stage handling two different packets from the burst
of input packets. On each pipeline iteration, the packets are pushed to the next pipeline stage:
for the 4-stage pipeline, two packets (that just completed stage 3) exit the pipeline, two packets
(that just completed stage 2) are now executing stage 3, two packets (that just completed stage
1) are now executing stage 2, two packets (that just completed stage 0) are now executing
stage 1 and two packets (next two packets to read from the burst of input packets) are entering
the pipeline to execute stage 0. The pipeline iterations continue until all packets from the burst
of input packets execute the last stage of the pipeline.
The bucket search logic is broken into pipeline stages at the boundary of the next memory
access. Each pipeline stage uses data structures that are stored (with high probability) into
the L1 or L2 cache memory of the current CPU core and breaks just before the next memory
access required by the algorithm. The current pipeline stage finalizes by prefetching the data
structures required by the next pipeline stage, so given enough time for the prefetch to com-
plete, when the next pipeline stage eventually gets executed for the same packets, it will read
the data structures it needs from L1 or L2 cache memory and thus avoid the significant penalty
incurred by L2 or L3 cache memory miss.
By prefetching the data structures required by the next pipeline stage in advance (before they
are used) and switching to executing another pipeline stage for different packets, the number of
L2 or L3 cache memory misses is greatly reduced, hence one of the main reasons for improved
performance. This is because the cost of L2/L3 cache memory miss on memory read accesses
is high, as usually due to data dependency between instructions, the CPU execution units have
to stall until the read operation is completed from L3 cache memory or external DRAM memory.
By using prefetch instructions, the latency of memory read accesses is hidden, provided that it
is preformed early enough before the respective data structure is actually used.
By splitting the processing into several stages that are executed on different packets (the pack-
ets from the input burst are interlaced), enough work is created to allow the prefetch instruc-
tions to complete successfully (before the prefetched data structures are actually accessed)
and also the data dependency between instructions is loosened. For example, for the 4-stage
pipeline, stage 0 is executed on packets 0 and 1 and then, before same packets 0 and 1 are
used (i.e. before stage 1 is executed on packets 0 and 1), different packets are used: packets
2 and 3 (executing stage 1), packets 4 and 5 (executing stage 2) and packets 6 and 7 (exe-
cuting stage 3). By executing useful work while the data structures are brought into the L1 or
L2 cache memory, the latency of the read memory accesses is hidden. By increasing the gap
between two consecutive accesses to the same data structure, the data dependency between
instructions is loosened; this allows making the best use of the super-scalar and out-of-order
execution CPU architecture, as the number of CPU core execution units that are active (rather
than idle or stalled due to data dependency constraints between instructions) is maximized.
The bucket search logic is also implemented without using any branch instructions. This avoids
the important cost associated with flushing the CPU core execution pipeline on every instance
of branch misprediction.
Fig. 36.3, Table 36.8 and Table 36.9 detail the main data structures used to implement con-
figurable key size hash tables (either LRU or extendable bucket, either with pre-computed
signature or “do-sig”).
Fig. 36.3: Data Structures for Configurable Key Size Hash Tables
Table 36.8: Main Large Data Structures (Arrays) used for Configurable Key Size Hash Tables
# Array name Number of Entry size Description
entries (bytes)
1 Bucket array n_buckets 32 Buckets of the hash table.
(configurable)
2 Bucket n_buckets_ext 32 This array is only created for
extensions (configurable) extendable bucket tables.
array
3 Key array n_keys key_size Keys added to the hash table.
(configurable)
4 Data array n_keys entry_size Key values (key data) associated
(configurable) with the hash table keys.
Table 36.9: Field Description for Bucket Array Entry (Configurable Key Size Hash Tables)
# Field name Field size (bytes) Description
1 Next Ptr/LRU 8 For LRU tables, this
fields represents the
LRU list for the current
bucket stored as array
of 4 entries of 2 bytes
each. Entry 0 stores
the index (0 .. 3) of the
MRU key, while entry
3 stores the index of
the LRU key.
For extendable bucket
tables, this field repre-
sents the next pointer
(i.e. the pointer to
the next group of 4
keys linked to the cur-
rent bucket). The
next pointer is not
NULL if the bucket is
currently extended or
NULL otherwise. To
help the branchless
implementation, bit 0
(least significant bit) of
this field is set to 1 if
the next pointer is not
NULL and to 0 other-
wise.
2 Sig[0 .. 3] 4x2 If key X (X = 0 .. 3)
is valid, then sig X bits
15 .. 1 store the most
significant 15 bits of
key X signature and
sig X bit 0 is set to 1.
If key X is not valid,
then sig X is set to
zero.
3 Key Pos [0 .. 3] 4x4 If key X is valid (X
= 0 .. 3), then Key
Pos X represents the
index into the key ar-
ray where key X is
stored, as well as the
index into the data ar-
ray where the value
associated with key X
is stored.
If key X is not valid,
then the value of Key
36.4. Table Library Design Pos X is undefined.
268
Programmer’s Guide, Release 17.11.10
Fig. 36.4 and Table 36.10 detail the bucket search pipeline stages (either LRU or extendable
bucket, either with pre-computed signature or “do-sig”). For each pipeline stage, the described
operations are applied to each of the two packets handled by that stage.
Fig. 36.4: Bucket Search Pipeline for Key Lookup Operation (Configurable Key Size Hash
Tables)
Table 36.10: Description of the Bucket Search Pipeline Stages (Configurable Key Size Hash
Tables)
# Stage name Description
0 Prefetch packet meta-data Select next two packets from
the burst of input packets.
Prefetch packet meta-data
containing the key and key
signature.
1 Prefetch table bucket Read the key signature from
the packet meta-data (for ex-
tendable bucket hash tables)
or read the key from the
packet meta-data and com-
pute key signature (for LRU
tables).
Identify the bucket ID using
the key signature.
Set bit 0 of the signature to
1 (to match only signatures of
valid keys from the table).
Prefetch the bucket.
2 Prefetch table key Read the key signatures from
the bucket.
Compare the signature of the
input key against the 4 key
signatures from the packet.
As result, the following is ob-
tained:
match = equal to TRUE if
there was at least one sig-
nature match and to FALSE
in the case of no signature
match;
match_many = equal to
TRUE is there were more
than one signature matches
(can be up to 4 signature
matches in the worst case
scenario) and to FALSE
otherwise;
match_pos = the index of the
first key that produced signa-
ture match (only valid if match
is true).
For extendable bucket hash
tables only, set match_many
to TRUE if next pointer is
valid.
Prefetch the bucket key indi-
cated by match_pos (even if
match_pos does not point to
36.4. Table Library Design valid key valid). 270
3 Prefetch table data Read the bucket key indicated
by match_pos.
Compare the bucket key
Programmer’s Guide, Release 17.11.10
Additional notes:
1. The pipelined version of the bucket search algorithm is executed only if there are at least
7 packets in the burst of input packets. If there are less than 7 packets in the burst of input
packets, a non-optimized implementation of the bucket search algorithm is executed.
2. Once the pipelined version of the bucket search algorithm has been executed for all the
packets in the burst of input packets, the non-optimized implementation of the bucket
search algorithm is also executed for any packets that did not produce a lookup hit, but
have the match_many flag set. As result of executing the non-optimized version, some
of these packets may produce a lookup hit or lookup miss. This does not impact the
performance of the key lookup operation, as the probability of matching more than one
signature in the same group of 4 keys or of having the bucket in extended state (for
extendable bucket hash tables only) is relatively small.
Key Signature Comparison Logic
The key signature comparison logic is described in Table 36.11.
Table 36.12: Collapsed Lookup Tables for Match, Match_Many and Match_Pos
Bit array Hexadecimal value
match 1111_1111_1111_1110 0xFFFELLU
match_many 1111_1110_1110_1000 0xFEE8LLU
match_pos 0001_0010_0001_0011__0001_0010_0001_0000 0x12131210LLU
The pseudo-code for match, match_many and match_pos is:
match = (0xFFFELLU >> mask) & 1;
Fig. 36.5, Fig. 36.6, Table 36.13 and Table 36.14 detail the main data structures used to im-
plement 8-byte and 16-byte key hash tables (either LRU or extendable bucket, either with
pre-computed signature or “do-sig”).
Table 36.13: Main Large Data Structures (Arrays) used for 8-byte and 16-byte Key Size Hash Tables
# Array name Number of entries Entry size (bytes) Description
1 Bucket array n_buckets (con- 8-byte key size: Buckets of the
figurable) 64 + 4 x en- hash table.
try_size
16-byte key size:
128 + 4 x en-
try_size
2 Bucket exten- n_buckets_ext 8-byte key size: This array is only
sions array (configurable) 64 + 4 x en- created for ex-
try_size tendable bucket
16-byte key size: tables.
128 + 4 x en-
try_size
Table 36.14: Field Description for Bucket Array Entry (8-byte and 16-byte Key Hash Tables)
# Field name Field size (bytes) Description
1 Valid 8 Bit X (X = 0 .. 3) is set
to 1 if key X is valid or
to 0 otherwise.
Bit 4 is only used for
extendable bucket ta-
bles to help with the
implementation of the
branchless logic. In
this case, bit 4 is set to
1 if next pointer is valid
(not NULL) or to 0 oth-
erwise.
2 Next Ptr/LRU 8 For LRU tables, this
fields represents the
LRU list for the current
bucket stored as array
of 4 entries of 2 bytes
each. Entry 0 stores
the index (0 .. 3) of the
MRU key, while entry
3 stores the index of
the LRU key.
For extendable bucket
tables, this field repre-
sents the next pointer
(i.e. the pointer to
the next group of 4
keys linked to the cur-
rent bucket). The
next pointer is not
NULL if the bucket is
currently extended or
NULL otherwise.
3 Key [0 .. 3] 4 x key_size Full keys.
4 Data [0 .. 3] 4 x entry_size Full key values (key
data) associated with
keys 0 .. 3.
and detail the bucket search pipeline used to implement 8-byte and 16-byte key hash tables
(either LRU or extendable bucket, either with pre-computed signature or “do-sig”). For each
pipeline stage, the described operations are applied to each of the two packets handled by that
stage.
Fig. 36.7: Bucket Search Pipeline for Key Lookup Operation (Single Key Size Hash Tables)
Table 36.15: Description of the Bucket Search Pipeline Stages (8-byte and 16-byte Key Hash
Tables)
# Stage name Description
0 Prefetch packet meta-data
1. Select next two packets
from the burst of input
packets.
2. Prefetch packet meta-
data containing the key
and key signature.
Additional notes:
1. The pipelined version of the bucket search algorithm is executed only if there are at least
5 packets in the burst of input packets. If there are less than 5 packets in the burst of input
packets, a non-optimized implementation of the bucket search algorithm is executed.
2. For extendable bucket hash tables only, once the pipelined version of the bucket search
algorithm has been executed for all the packets in the burst of input packets, the non-
optimized implementation of the bucket search algorithm is also executed for any packets
that did not produce a lookup hit, but have the bucket in extended state. As result of
executing the non-optimized version, some of these packets may produce a lookup hit or
lookup miss. This does not impact the performance of the key lookup operation, as the
probability of having the bucket in extended state is relatively small.
To avoid any dependencies on the order in which pipeline elements are created, the connec-
tivity of pipeline elements is defined after all the pipeline input ports, output ports and tables
have been created.
General connectivity rules:
1. Each input port is connected to a single table. No input port should be left unconnected;
2. The table connectivity to other tables or to output ports is regulated by the next hop
actions of each table entry and the default table entry. The table connectivity is fluid, as
the table entries and the default table entry can be updated during run-time.
• A table can have multiple entries (including the default entry) connected to the same
output port. A table can have different entries connected to different output ports.
Different tables can have entries (including default table entry) connected to the
same output port.
• A table can have multiple entries (including the default entry) connected to another
table, in which case all these entries have to point to the same table. This constraint
is enforced by the API and prevents tree-like topologies from being created (allow-
ing table chaining only), with the purpose of simplifying the implementation of the
pipeline run-time execution engine.
An action handler can be assigned to each input/output port to define actions to be executed
on each input packet that is received by the port. Defining the action handler for a specific
input/output port is optional (i.e. the action handler can be disabled).
For input ports, the action handler is executed after RX function. For output ports, the action
handler is executed before the TX function.
The action handler can decide to drop packets.
An action handler to be executed on each input packet can be assigned to each table. Defining
the action handler for a specific table is optional (i.e. the action handler can be disabled).
The action handler is executed after the table lookup operation is performed and the table
entry associated with each input packet is identified. The action handler can only handle the
user-defined actions, while the reserved actions (e.g. the next hop actions) are handled by the
Packet Framework. The action handler can decide to drop the input packet.
Reserved Actions
The reserved actions are handled directly by the Packet Framework without the user being able
to change their meaning through the table action handler configuration. A special category of
the reserved actions is represented by the next hop actions, which regulate the packet flow
between input ports, tables and output ports through the pipeline. Table 36.16 lists the next
hop actions.
User Actions
For each table, the meaning of user actions is defined through the configuration of the table
action handler. Different tables can be configured with different action handlers, therefore the
meaning of the user actions and their associated meta-data is private to each table. Within
the same table, all the table entries (including the table default entry) share the same definition
for the user actions and their associated meta-data, with each table entry having its own set
of enabled user actions and its own copy of the action meta-data. Table 36.17 contains a
non-exhaustive list of user action examples.
A complex application is typically split across multiple cores, with cores communicating through
SW queues. There is usually a performance limit on the number of table lookups and actions
that can be fitted on the same CPU core due to HW constraints like: available CPU cycles,
cache memory size, cache transfer BW, memory transfer BW, etc.
As the application is split across multiple CPU cores, the Packet Framework facilitates the
creation of several pipelines, the assignment of each such pipeline to a different CPU core
and the interconnection of all CPU core-level pipelines into a single application-level complex
pipeline. For example, if CPU core A is assigned to run pipeline P1 and CPU core B pipeline
P2, then the interconnection of P1 with P2 could be achieved by having the same set of SW
queues act like output ports for P1 and input ports for P2.
This approach enables the application development using the pipeline, run-to-completion (clus-
tered) or hybrid (mixed) models.
It is allowed for the same core to run several pipelines, but it is not allowed for several cores to
run the same pipeline.
The threads performing table lookup are actually table writers rather than just readers. Even if
the specific table lookup algorithm is thread-safe for multiple readers (e. g. read-only access
of the search algorithm data structures is enough to conduct the lookup operation), once the
table entry for the current packet is identified, the thread is typically expected to update the
action meta-data stored in the table entry (e.g. increment the counter tracking the number of
packets that hit this table entry), and thus modify the table entry. During the time this thread
is accessing this table entry (either writing or reading; duration is application specific), for data
consistency reasons, no other threads (threads performing table lookup or entry add/delete
operations) are allowed to modify this table entry.
The presence of accelerators is usually detected during the initialization phase by inspecting
the HW devices that are part of the system (e.g. by PCI bus enumeration). Typical devices
with acceleration capabilities are:
• Inline accelerators: NICs, switches, FPGAs, etc;
• Look-aside accelerators: chipsets, FPGAs, etc.
Usually, to support a specific functional block, specific implementation of Packet Framework
tables and/or ports and/or actions has to be provided for each accelerator, with all the imple-
mentations sharing the same API: pure SW implementation (no acceleration), implementation
using accelerator A, implementation using accelerator B, etc. The selection between these
implementations could be done at build time or at run-time (recommended), based on which
accelerators are present in the system, with no application changes required.
THIRTYSEVEN
VHOST LIBRARY
The vhost library implements a user space virtio net server allowing the user to manipulate the
virtio ring directly. In another words, it allows the user to fetch/put packets from/to the VM virtio
net device. To achieve this, a vhost library should be able to:
• Access the guest memory:
For QEMU, this is done by using the -object
memory-backend-file,share=on,... option. Which means QEMU will cre-
ate a file to serve as the guest RAM. The share=on option allows another process to
map that file, which means it can access the guest RAM.
• Know all the necessary information about the vring:
Information such as where the available ring is stored. Vhost defines some messages
(passed through a Unix domain socket file) to tell the backend all the information it needs
to know how to manipulate the vring.
280
Programmer’s Guide, Release 17.11.10
– RTE_VHOST_USER_DEQUEUE_ZERO_COPY
Dequeue zero copy will be enabled when this flag is set. It is disabled by default.
There are some truths (including limitations) you might want to know while setting
this flag:
* zero copy is not good for small packets (typically for packet size below 512).
* zero copy is really good for VM2VM case. For iperf between two VMs, the boost
could be above 70% (when TSO is enableld).
* for VM2NIC case, the nb_tx_desc has to be small enough: <= 64 if virtio
indirect feature is not enabled and <= 128 if it is enabled.
This is because when dequeue zero copy is enabled, guest Tx used vring will
be updated only when corresponding mbuf is freed. Thus, the nb_tx_desc has
to be small enough so that the PMD driver will run out of available Tx descriptors
and free mbufs timely. Otherwise, guest Tx vring would be starved.
* Guest memory should be backended with huge pages to achieve better perfor-
mance. Using 1G page size is the best.
When dequeue zero copy is enabled, the guest phys address and host phys
address mapping has to be established. Using non-huge pages means far more
page segments. To make it simple, DPDK vhost does a linear search of those
segments, thus the fewer the segments, the quicker we will get the mapping.
NOTE: we may speed it by using tree searching in future.
– RTE_VHOST_USER_IOMMU_SUPPORT
IOMMU support will be enabled when this flag is set. It is disabled by default.
Enabling this flag makes possible to use guest vIOMMU to protect vhost from ac-
cessing memory the virtio device isn’t allowed to, when the feature is negotiated and
an IOMMU device is declared.
However, this feature enables vhost-user’s reply-ack protocol feature, which imple-
mentation is buggy in Qemu v2.7.0-v2.9.0 when doing multiqueue. Enabling this
flag with these Qemu version results in Qemu being blocked when multiple queue
pairs are declared.
• rte_vhost_driver_set_features(path,features)
This function sets the feature bits the vhost-user driver supports. The vhost-user driver
could be vhost-user net, yet it could be something else, say, vhost-user SCSI.
• rte_vhost_driver_callback_register(path,vhost_device_ops)
This function registers a set of callbacks, to let DPDK applications take the appropriate
action when some events happen. The following events are currently supported:
– new_device(int vid)
This callback is invoked when a virtio device becomes ready. vid is the vhost device
ID.
– destroy_device(int vid)
This callback is invoked when a virtio device is paused or shut down.
Vhost-user uses Unix domain sockets for passing messages. This means the DPDK vhost-
user implementation has two options:
• DPDK vhost-user acts as the server.
DPDK will create a Unix domain socket server file and listen for connections from the
frontend.
Note, this is the default mode, and the only mode before DPDK v16.07.
• DPDK vhost-user acts as the client.
Unlike the server mode, this mode doesn’t create the socket file; it just tries to connect to
the server (which responses to create the file instead).
When the DPDK vhost-user application restarts, DPDK vhost-user will try to connect to
the server again. This is how the “reconnect” feature works.
Note:
– The “reconnect” feature requires QEMU v2.7 (or above).
– The vhost supported features must be exactly the same before and after the restart.
For example, if TSO is disabled and then enabled, nothing will work and issues
undefined might happen.
No matter which mode is used, once a connection is established, DPDK vhost-user will start
receiving and processing vhost messages from QEMU.
For messages with a file descriptor, the file descriptor can be used directly in the vhost process
as it is already installed by the Unix domain socket.
The supported vhost messages are:
• VHOST_SET_MEM_TABLE
• VHOST_SET_VRING_KICK
• VHOST_SET_VRING_CALL
• VHOST_SET_LOG_FD
• VHOST_SET_VRING_ERR
For VHOST_SET_MEM_TABLE message, QEMU will send information for each memory region
and its file descriptor in the ancillary data of the message. The file descriptor is used to map
that region.
VHOST_SET_VRING_KICK is used as the signal to put the vhost device into the data plane,
and VHOST_GET_VRING_BASE is used as the signal to remove the vhost device from the data
plane.
When the socket connection is closed, vhost will destroy the device.
For more vhost details and how to support vhost in vSwitch, please refer to the vhost example
in the DPDK Sample Applications Guide.
THIRTYEIGHT
METRICS LIBRARY
The Metrics library implements a mechanism by which producers can publish numeric infor-
mation for later querying by consumers. In practice producers will typically be other libraries or
primary processes, whereas consumers will typically be applications.
Metrics themselves are statistics that are not generated by PMDs. Metric information is pop-
ulated using a push model, where producers update the values contained within the metric
library by calling an update function on the relevant metrics. Consumers receive metric infor-
mation by querying the central metric data, which is held in shared memory.
For each metric, a separate value is maintained for each port id, and when publishing metric
values the producers need to specify which port is being updated. In addition there is a special
id RTE_METRICS_GLOBAL that is intended for global statistics that are not associated with
any individual device. Since the metrics library is self-contained, the only restriction on port
numbers is that they are less than RTE_MAX_ETHPORTS - there is no requirement for the ports
to actually exist.
Before the library can be used, it has to be initialized by calling rte_metrics_init() which
sets up the metric store in shared memory. This is where producers will publish metric infor-
mation to, and where consumers will query it from.
rte_metrics_init(rte_socket_id());
This function must be called from a primary process, but otherwise producers and consumers
can be in either primary or secondary processes.
Metrics must first be registered, which is the way producers declare the names of the metrics
they will be publishing. Registration can either be done individually, or a set of metrics can be
registered as a group. Individual registration is done using rte_metrics_reg_name():
id_1 = rte_metrics_reg_name("mean_bits_in");
id_2 = rte_metrics_reg_name("mean_bits_out");
id_3 = rte_metrics_reg_name("peak_bits_in");
id_4 = rte_metrics_reg_name("peak_bits_out");
284
Programmer’s Guide, Release 17.11.10
If the return value is negative, it means registration failed. Otherwise the return value is the key
for the metric, which is used when updating values. A table mapping together these key values
and the metrics’ names can be obtained using rte_metrics_get_names().
Once registered, producers can update the metric for a given port using the
rte_metrics_update_value() function. This uses the metric key that is returned when
registering the metric, and can also be looked up using rte_metrics_get_names().
rte_metrics_update_value(port_id, id_1, values[0]);
rte_metrics_update_value(port_id, id_2, values[1]);
rte_metrics_update_value(port_id, id_3, values[2]);
rte_metrics_update_value(port_id, id_4, values[3]);
if metrics were registered as a single set, they can either be updated indi-
vidually using rte_metrics_update_value(), or updated together using the
rte_metrics_update_values() function:
rte_metrics_update_value(port_id, id_set, values[0]);
rte_metrics_update_value(port_id, id_set + 1, values[1]);
rte_metrics_update_value(port_id, id_set + 2, values[2]);
rte_metrics_update_value(port_id, id_set + 3, values[3]);
Consumers can obtain metric values by querying the metrics library using
the rte_metrics_get_values() function that return an array of struct
rte_metric_value. Each entry within this array contains a metric value and its asso-
ciated key. A key-name mapping can be obtained using the rte_metrics_get_names()
function that returns an array of struct rte_metric_name that is indexed by the key. The
following will print out all metrics for a given port:
void print_metrics() {
struct rte_metric_value *metrics;
struct rte_metric_name *names;
int len;
return;
}
metrics = malloc(sizeof(struct rte_metric_value) * len);
names = malloc(sizeof(struct rte_metric_name) * len);
if (metrics == NULL || names == NULL) {
printf("Cannot allocate memory\n");
free(metrics);
free(names);
return;
}
ret = rte_metrics_get_values(port_id, metrics, len);
if (ret < 0 || ret > len) {
printf("Cannot get metrics values\n");
free(metrics);
free(names);
return;
}
printf("Metrics for port %i:\n", port_id);
for (i = 0; i < len; i++)
printf(" %s: %"PRIu64"\n",
names[metrics[i].key].name, metrics[i].value);
free(metrics);
free(names);
}
The bit-rate library calculates the exponentially-weighted moving average and peak bit-rates
for each active port (i.e. network device). These statistics are reported via the metrics library
using the following names:
• mean_bits_in: Average inbound bit-rate
• mean_bits_out: Average outbound bit-rate
• ewma_bits_in: Average inbound bit-rate (EWMA smoothed)
• ewma_bits_out: Average outbound bit-rate (EWMA smoothed)
• peak_bits_in: Peak inbound bit-rate
• peak_bits_out: Peak outbound bit-rate
Once initialised and clocked at the appropriate frequency, these statistics can be obtained by
querying the metrics library.
38.5.1 Initialization
bitrate_data = rte_stats_bitrate_create();
if (bitrate_data == NULL)
Since the library works by periodic sampling but does not use an internal thread, the application
has to periodically call rte_stats_bitrate_calc(). The frequency at which this function is
called should be the intended sampling rate required for the calculated statistics. For instance
if per-second statistics are desired, this function should be called once a second.
tics_datum = rte_rdtsc();
tics_per_1sec = rte_get_timer_hz();
while( 1 ) {
/* ... */
tics_current = rte_rdtsc();
if (tics_current - tics_datum >= tics_per_1sec) {
/* Periodic bitrate calculation */
for (idx_port = 0; idx_port < cnt_ports; idx_port++)
rte_stats_bitrate_calc(bitrate_data, idx_port);
tics_datum = tics_current;
}
/* ... */
}
The latency statistics library calculates the latency of packet processing by a DPDK application,
reporting the minimum, average, and maximum nano-seconds that packet processing takes,
as well as the jitter in processing delay. These statistics are then reported via the metrics
library using the following names:
• min_latency_ns: Minimum processing latency (nano-seconds)
• avg_latency_ns: Average processing latency (nano-seconds)
• mac_latency_ns: Maximum processing latency (nano-seconds)
• jitter_ns: Variance in processing latency (nano-seconds)
Once initialised and clocked at the appropriate frequency, these statistics can be obtained by
querying the metrics library.
38.6.1 Initialization
THIRTYNINE
The Port Hotplug Framework provides DPDK applications with the ability to attach and detach
ports at runtime. Because the framework depends on PMD implementation, the ports that
PMDs cannot handle are out of scope of this framework. Furthermore, after detaching a port
from a DPDK application, the framework doesn’t provide a way for removing the devices from
the system. For the ports backed by a physical NIC, the kernel will need to support PCI Hotplug
feature.
39.1 Overview
• Attaching a port
“rte_eth_dev_attach()” API attaches a port to DPDK application, and returns the attached
port number. Before calling the API, the device should be recognized by an userspace
driver I/O framework. The API receives a pci address like “0000:01:00.0” or a virtual
289
Programmer’s Guide, Release 17.11.10
device name like “net_pcap0,iface=eth0”. In the case of virtual device name, the format
is the same as the general “–vdev” option of DPDK.
• Detaching a port
“rte_eth_dev_detach()” API detaches a port from DPDK application, and returns a pci
address of the detached device or a virtual device name of the device.
39.3 Reference
39.4 Limitations
FORTY
SOURCE ORGANIZATION
Note: In the following descriptions, RTE_SDK is the environment variable that points to the
base directory into which the tarball was extracted. See Useful Variables Provided by the Build
System for descriptions of other variables.
Makefiles that are provided by the DPDK libraries and applications are located in
$(RTE_SDK)/mk.
Config templates are located in $(RTE_SDK)/config. The templates describe the options
that are enabled for each target. The config file also contains items that can be enabled and
disabled for many of the DPDK libraries, including debug options. The user should look at
the config file and become familiar with these options. The config file is also used to create a
header file, which will be located in the new build directory.
40.2 Libraries
291
Programmer’s Guide, Release 17.11.10
40.3 Drivers
Drivers are special libraries which provide poll-mode driver implementations for devices: either
hardware devices or pseudo/virtual devices. They are contained in the drivers subdirectory,
classified by type, and each compiles to a library with the format librte_pmd_X.a where X
is the driver name.
The drivers directory has a net subdirectory which contains:
drivers/net
+-- af_packet # Poll mode driver based on Linux af_packet
+-- bonding # Bonding poll mode driver
+-- cxgbe # Chelsio Terminator 10GbE/40GbE poll mode driver
+-- e1000 # 1GbE poll mode drivers (igb and em)
+-- enic # Cisco VIC Ethernet NIC Poll-mode Driver
+-- fm10k # Host interface PMD driver for FM10000 Series
+-- i40e # 40GbE poll mode driver
+-- ixgbe # 10GbE poll mode driver
+-- mlx4 # Mellanox ConnectX-3 poll mode driver
+-- null # NULL poll mode driver for testing
+-- pcap # PCAP poll mode driver
+-- ring # Ring poll mode driver
+-- szedata2 # SZEDATA2 poll mode driver
+-- virtio # Virtio poll mode driver
+-- vmxnet3 # VMXNET3 poll mode driver
Note: Several of the driver/net directories contain a base sub-directory. The base direc-
tory generally contains code the shouldn’t be modified directly by the user. Any enhancements
should be done via the X_osdep.c and/or X_osdep.h files in that directory. Refer to the local
README in the base directories for driver specific instructions.
40.4 Applications
Applications are source files that contain a main() function. They are located in the
$(RTE_SDK)/app and $(RTE_SDK)/examples directories.
The app directory contains sample applications that are used to test DPDK (such as autotests)
or the Poll Mode Drivers (test-pmd):
app
+-- chkincs # Test program to check include dependencies
+-- cmdline_test # Test the commandline library
+-- test # Autotests to validate DPDK features
+-- test-acl # Test the ACL library
+-- test-pipeline # Test the IP Pipeline framework
+-- test-pmd # Test and benchmark poll mode drivers
The examples directory contains sample applications that show how libraries can be used:
examples
+-- cmdline # Example of using the cmdline library
Note: The actual examples directory may contain additional sample applications to those
shown above. Check the latest DPDK source files for details.
FORTYONE
The DPDK requires a build system for compilation activities and so on. This section describes
the constraints and the mechanisms used in the DPDK framework.
There are two use-cases for the framework:
• Compilation of the DPDK libraries and sample applications; the framework generates
specific binary libraries, include files and sample applications
• Compilation of an external application or library, using an installed binary DPDK
After installation, a build directory structure is created. Each build directory contains include
files, libraries, and applications.
A build directory is specific to a configuration that includes architecture + execution environ-
ment + toolchain. It is possible to have several build directories sharing the same sources with
different configurations.
For instance, to create a new build directory called my_sdk_build_dir using the default config-
uration template config/defconfig_x86_64-linuxapp, we use:
cd ${RTE_SDK}
make config T=x86_64-native-linuxapp-gcc O=my_sdk_build_dir
This creates a new my_sdk_build_dir directory. After that, we can compile by doing:
cd my_sdk_build_dir
make
294
Programmer’s Guide, Release 17.11.10
Refer to Development Kit Root Makefile Help for details about make commands that can be
used from the root of DPDK.
Since DPDK is in essence a development kit, the first objective of end users will be to create
an application using this SDK. To compile an application, the user must set the RTE_SDK and
RTE_TARGET environment variables.
export RTE_SDK=/opt/DPDK
export RTE_TARGET=x86_64-native-linuxapp-gcc
cd /path/to/my_app
For a new application, the user must create their own Makefile that includes some .mk files,
such as ${RTE_SDK}/mk/rte.vars.mk, and ${RTE_SDK}/mk/ rte.app.mk. This is described in
Building Your Own Application.
Depending on the chosen target (architecture, machine, executive environment, toolchain) de-
fined in the Makefile or as an environment variable, the applications and libraries will com-
pile using the appropriate .h files and will link with the appropriate .a files. These files are
located in ${RTE_SDK}/arch-machine-execenv-toolchain, which is referenced internally by
${RTE_BIN_SDK}.
To compile their application, the user just has to call make. The compilation result will be
located in /path/to/my_app/build directory.
# binary name
APP = helloworld
CFLAGS += -O3
CFLAGS += $(WERROR_FLAGS)
include $(RTE_SDK)/mk/rte.extapp.mk
Depending on the .mk file which is included at the end of the user Makefile, the Makefile will
have a different role. Note that it is not possible to build a library and an application in the
same Makefile. For that, the user must create two separate Makefiles, possibly in two different
directories.
In any case, the rte.vars.mk file must be included in the user Makefile as soon as possible.
Application
Library
Generate a .a library.
• rte.lib.mk: Library in the development kit framework
• rte.extlib.mk: external library
• rte.hostlib.mk: host library in the development kit framework
Install
• rte.install.mk: Does not build anything, it is only used to create links or copy files to the
installation directory. This is useful for including files in the development kit framework.
Kernel Module
Objects
• rte.obj.mk: Object aggregation (merge several .o in one) in the development kit frame-
work.
• rte.extobj.mk: Object aggregation (merge several .o in one) outside the development kit
framework.
Misc
app/dpdk-pmdinfogen
dpdk-pmdinfogen scans an object (.o) file for various well known symbol names. These well
known symbol names are defined by various macros and used to export important information
about hardware support and usage for pmd files. For instance the macro:
RTE_PMD_REGISTER_PCI(name, drv)
Which dpdk-pmdinfogen scans for. Using this information other relevant bits of data can
be exported from the object file and used to produce a hardware support description, that
dpdk-pmdinfogen then encodes into a json formatted string in the following format:
static char <name_pmd_string>="PMD_INFO_STRING=\"{'name' : '<name>', ...}\"";
These strings can then be searched for by external tools to determine the hardware support of
a given library or application.
• RTE_SDK: The absolute path to the DPDK sources. When compiling the development
kit, this variable is automatically set by the framework. It has to be defined by the user as
an environment variable if compiling an external application.
• RTE_SRCDIR: The path to the root of the sources. When compiling the development kit,
RTE_SRCDIR = RTE_SDK. When compiling an external application, the variable points
to the root of external application sources.
• RTE_OUTPUT: The path to which output files are written. Typically, it is
$(RTE_SRCDIR)/build, but it can be overridden by the O= option in the make command
line.
• RTE_TARGET: A string identifying the target for which we are building. The format is
arch-machine-execenv-toolchain. When compiling the SDK, the target is deduced by the
build system from the configuration (.config). When building an external application, it
must be specified by the user in the Makefile or as an environment variable.
• RTE_SDK_BIN: References $(RTE_SDK)/$(RTE_TARGET).
• RTE_ARCH: Defines the architecture (i686, x86_64). It is the same value as CON-
FIG_RTE_ARCH but without the double-quotes around the string.
• RTE_MACHINE: Defines the machine. It is the same value as CONFIG_RTE_MACHINE
but without the double-quotes around the string.
• RTE_TOOLCHAIN: Defines the toolchain (gcc , icc). It is the same value as CON-
FIG_RTE_TOOLCHAIN but without the double-quotes around the string.
• RTE_EXEC_ENV: Defines the executive environment (linuxapp). It is the same value as
CONFIG_RTE_EXEC_ENV but without the double-quotes around the string.
• RTE_KERNELDIR: This variable contains the absolute path to the kernel sources that
will be used to compile the kernel modules. The kernel headers must be the same as the
ones that will be used on the target machine (the machine that will run the application).
By default, the variable is set to /lib/modules/$(shell uname -r)/build, which is correct
when the target machine is also the build machine.
• RTE_DEVEL_BUILD: Stricter options (stop on warning). It defaults to y in a git tree.
• VPATH: The path list that the build system will search for sources. By default,
RTE_SRCDIR will be included in VPATH.
• CFLAGS: Flags to use for C compilation. The user should use += to append data in this
variable.
• LDFLAGS: Flags to use for linking. The user should use += to append data in this vari-
able.
• ASFLAGS: Flags to use for assembly. The user should use += to append data in this
variable.
• CPPFLAGS: Flags to use to give flags to C preprocessor (only useful when assembling
.S files). The user should use += to append data in this variable.
• LDLIBS: In an application, the list of libraries to link with (for example, -L /path/to/libfoo
-lfoo ). The user should use += to append data in this variable.
• SRC-y: A list of source files (.c, .S, or .o if the source is a binary) in case of application,
library or object Makefiles. The sources must be available from VPATH.
• INSTALL-y-$(INSTPATH): A list of files to be installed in $(INSTPATH). The files must be
available from VPATH and will be copied in $(RTE_OUTPUT)/$(INSTPATH). Can be used
in almost any RTE Makefile.
• SYMLINK-y-$(INSTPATH): A list of files to be installed in $(INSTPATH). The files must be
available from VPATH and will be linked (symbolically) in $(RTE_OUTPUT)/$(INSTPATH).
This variable can be used in almost any DPDK Makefile.
• PREBUILD: A list of prerequisite actions to be taken before building. The user should
use += to append data in this variable.
• POSTBUILD: A list of actions to be taken after the main build. The user should use += to
append data in this variable.
• PREINSTALL: A list of prerequisite actions to be taken before installing. The user should
use += to append data in this variable.
• POSTINSTALL: A list of actions to be taken after installing. The user should use += to
append data in this variable.
• PRECLEAN: A list of prerequisite actions to be taken before cleaning. The user should
use += to append data in this variable.
• POSTCLEAN: A list of actions to be taken after cleaning. The user should use += to
append data in this variable.
• DEPDIRS-$(DIR): Only used in the development kit framework to specify if the build of
the current directory depends on build of another one. This is needed to support parallel
builds correctly.
41.3.6 Variables that can be Set/Overridden by the User on the Command Line
Only
Some variables can be used to configure the build system behavior. They are documented in
Development Kit Root Makefile Help and External Application/Library Makefile Help
• WERROR_CFLAGS: By default, this is set to a specific value that depends on the com-
piler. Users are encouraged to use this variable as follows:
CFLAGS += $(WERROR_CFLAGS)
This avoids the use of different cases depending on the compiler (icc or gcc). Also, this variable
can be overridden from the command line, which allows bypassing of the flags for testing
purposes.
• EXTRA_CFLAGS: The content of this variable is appended after CFLAGS when compil-
ing.
• EXTRA_LDFLAGS: The content of this variable is appended after LDFLAGS when link-
ing.
• EXTRA_LDLIBS: The content of this variable is appended after LDLIBS when linking.
• EXTRA_ASFLAGS: The content of this variable is appended after ASFLAGS when as-
sembling.
• EXTRA_CPPFLAGS: The content of this variable is appended after CPPFLAGS when
using a C preprocessor on assembly files.
FORTYTWO
The DPDK provides a root level Makefile with targets for configuration, building, cleaning, test-
ing, installation and others. These targets are explained in the following sections.
The configuration target requires the name of the target, which is specified using T=mytarget
and it is mandatory. The list of available targets are in $(RTE_SDK)/config (remove the def-
config _ prefix).
Configuration targets also support the specification of the name of the output directory, using
O=mybuilddir. This is an optional parameter, the default output directory is build.
• Config
This will create a build directory, and generates a configuration from a template. A Make-
file is also created in the new build directory.
Example:
make config O=mybuild T=x86_64-native-linuxapp-gcc
Build targets support the optional specification of the name of the output directory, using
O=mybuilddir. The default output directory is build.
• all, build or just make
Build the DPDK in the output directory previously created by a make config.
Example:
make O=mybuild
• clean
Clean all objects created using make build.
Example:
make clean O=mybuild
301
Programmer’s Guide, Release 17.11.10
• %_sub
Build a subdirectory only, without managing dependencies on other directories.
Example:
make lib/librte_eal_sub O=mybuild
• %_clean
Clean a subdirectory only.
Example:
make lib/librte_eal_clean O=mybuild
• Install
The list of available targets are in $(RTE_SDK)/config (remove the defconfig_ prefix).
The GNU standards variables may be used: http://gnu.org/prep/standards/html_node/
Directory-Variables.html and http://gnu.org/prep/standards/html_node/DESTDIR.html
Example:
make install DESTDIR=myinstall prefix=/usr
• test
Launch automatic tests for a build directory specified using O=mybuilddir. It is optional,
the default output directory is build.
Example:
make test O=mybuild
• doc
Generate the documentation (API and guides).
• doc-api-html
Generate the Doxygen API documentation in html.
• doc-guides-html
Generate the guides documentation in html.
• doc-guides-pdf
Generate the guides documentation in pdf.
• help
Show a quick help.
All targets described above are called from the SDK root $(RTE_SDK). It is possible to run the
same Makefile targets inside the build directory. For instance, the following command:
cd $(RTE_SDK)
make config O=mybuild T=x86_64-native-linuxapp-gcc
make O=mybuild
is equivalent to:
cd $(RTE_SDK)
make config O=mybuild T=x86_64-native-linuxapp-gcc
cd mybuild
To compile the DPDK and sample applications with debugging information included and the
optimization level set to 0, the EXTRA_CFLAGS environment variable should be set before
compiling as follows:
export EXTRA_CFLAGS='-O0 -g'
FORTYTHREE
This chapter describes how a developer can extend the DPDK to provide a new library, a new
target, or support a new target.
Declaration is in foo.h:
extern void foo(void);
4. Update lib/Makefile:
vi ${RTE_SDK}/lib/Makefile
# add:
# DIRS-$(CONFIG_RTE_LIBFOO) += libfoo
5. Create a new Makefile for this library, for example, derived from mempool Makefile:
cp ${RTE_SDK}/lib/librte_mempool/Makefile ${RTE_SDK}/lib/libfoo/
vi ${RTE_SDK}/lib/libfoo/Makefile
# replace:
# librte_mempool -> libfoo
# rte_mempool -> foo
6. Update mk/DPDK.app.mk, and add -lfoo in LDLIBS variable when the option is enabled.
This will automatically add this flag when linking a DPDK application.
7. Build the DPDK with the new library (we only show a specific target here):
304
Programmer’s Guide, Release 17.11.10
cd ${RTE_SDK}
make config T=x86_64-native-linuxapp-gcc
make
The test application is used to validate all functionality of the DPDK. Once you have added a
library, a new test case should be added in the test application.
• A new test_foo.c file should be added, that includes foo.h and calls the foo() function from
test_foo(). When the test passes, the test_foo() function should return 0.
• Makefile, test.h and commands.c must be updated also, to handle the new test case.
• Test report generation: autotest.py is a script that is used to generate the test re-
port that is available in the ${RTE_SDK}/doc/rst/test_report/autotests directory. This
script must be updated also. If libfoo is in a new test family, the links in
${RTE_SDK}/doc/rst/test_report/test_report.rst must be updated.
• Build the DPDK with the updated test application (we only show a specific target here):
cd ${RTE_SDK}
make config T=x86_64-native-linuxapp-gcc
make
FORTYFOUR
When compiling a sample application (for example, hello world), the following variables must
be exported: RTE_SDK and RTE_TARGET.
~/DPDK$ cd examples/helloworld/
~/DPDK/examples/helloworld$ export RTE_SDK=/home/user/DPDK
~/DPDK/examples/helloworld$ export RTE_TARGET=x86_64-native-linuxapp-gcc
~/DPDK/examples/helloworld$ make
CC main.o
LD helloworld
INSTALL-APP helloworld
INSTALL-MAP helloworld.map
The sample application (Hello World) can be duplicated in a new directory as a starting point
for your development:
~$ cp -r DPDK/examples/helloworld my_rte_app
~$ cd my_rte_app/
~/my_rte_app$ export RTE_SDK=/home/user/DPDK
~/my_rte_app$ export RTE_TARGET=x86_64-native-linuxapp-gcc
~/my_rte_app$ make
CC main.o
LD helloworld
INSTALL-APP helloworld
INSTALL-MAP helloworld.map
The default makefile provided with the Hello World sample application is a good starting point.
It includes:
306
Programmer’s Guide, Release 17.11.10
Some variables can be defined to customize Makefile actions. The most common are listed
below. Refer to Makefile Description section in Development Kit Build System
chapter for details.
• VPATH: The path list where the build system will search for sources. By default,
RTE_SRCDIR will be included in VPATH.
• CFLAGS_my_file.o: The specific flags to add for C compilation of my_file.c.
• CFLAGS: The flags to use for C compilation.
• LDFLAGS: The flags to use for linking.
• CPPFLAGS: The flags to use to provide flags to the C preprocessor (only useful when
assembling .S files)
• LDLIBS: A list of libraries to link with (for example, -L /path/to/libfoo - lfoo)
FORTYFIVE
External applications or libraries should include specific Makefiles from RTE_SDK, located in
mk directory. These Makefiles are:
• ${RTE_SDK}/mk/rte.extapp.mk: Build an application
• ${RTE_SDK}/mk/rte.extlib.mk: Build a static library
• ${RTE_SDK}/mk/rte.extobj.mk: Build objects (.o)
45.1 Prerequisites
Build targets support the specification of the name of the output directory, using O=mybuilddir.
This is optional; the default output directory is build.
• all, “nothing” (meaning just make)
Build the application or the library in the specified output directory.
Example:
make O=mybuild
• clean
Clean all objects created using make build.
Example:
make clean O=mybuild
• help
308
Programmer’s Guide, Release 17.11.10
It is possible to run the Makefile from another directory, by specifying the output and the source
dir. For example:
export RTE_SDK=/path/to/DPDK
export RTE_TARGET=x86_64-native-linuxapp-icc
make -f /path/to/my_app/Makefile S=/path/to/my_app O=/path/to/build_dir
FORTYSIX
46.1 Introduction
The following sections describe optimizations used in DPDK and optimizations that should be
considered for new applications.
They also highlight the performance-impacting coding techniques that should, and should not
be, used when developing an application using the DPDK.
And finally, they give an introduction to application profiling using a Performance Analyzer from
Intel to optimize the software.
310
CHAPTER
FORTYSEVEN
This chapter provides some tips for developing efficient code using the DPDK. For additional
and more general information, please refer to the Intel® 64 and IA-32 Architectures Optimiza-
tion Reference Manual which is a valuable reference to writing efficient code.
47.1 Memory
This section describes some key memory considerations when developing applications in the
DPDK environment.
Many libc functions are available in the DPDK, via the Linux* application environment. This
can ease the porting of applications and the development of the configuration plane. However,
many of these functions are not designed for performance. Functions such as memcpy() or
strcpy() should not be used in the data plane. To copy small structures, the preference is for
a simpler technique that can be optimized by the compiler. Refer to the VTune™ Performance
Analyzer Essentials publication from Intel Press for recommendations.
For specific functions that are called often, it is also a good idea to provide a self-made opti-
mized function, which should be declared as static inline.
The DPDK API provides an optimized rte_memcpy() function.
Other functions of libc, such as malloc(), provide a flexible way to allocate and free memory. In
some cases, using dynamic allocation is necessary, but it is really not advised to use malloc-
like functions in the data plane because managing a fragmented heap can be costly and the
allocator may not be optimized for parallel allocation.
If you really need dynamic allocation in the data plane, it is better to use a memory pool of
fixed-size objects. This API is provided by librte_mempool. This data structure provides several
services that increase performance, such as memory alignment of objects, lockless access to
objects, NUMA awareness, bulk get/put and per-lcore cache. The rte_malloc () function uses
a similar concept to mempools.
311
Programmer’s Guide, Release 17.11.10
Read-Write (RW) access operations by several lcores to the same memory area can generate
a lot of data cache misses, which are very costly. It is often possible to use per-lcore variables,
for example, in the case of statistics. There are at least two solutions for this:
• Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available
to lcore Y.
• Use a table of structures (one per lcore). In this case, each structure must be cache-
aligned.
Read-mostly variables can be shared among lcores without performance losses if there are no
RW variables in the same cache line.
47.1.4 NUMA
On a NUMA system, it is preferable to access local memory since remote memory access
is slower. In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to
create a pool on a specific socket.
Sometimes, it can be a good idea to duplicate data to optimize speed. For read-mostly vari-
ables that are often accessed, it should not be a problem to keep them in one socket only,
since data will be present in cache.
Modern memory controllers have several memory channels that can load or store data in par-
allel. Depending on the memory controller and its configuration, the number of channels and
the way the memory is distributed across the channels varies. Each channel has a bandwidth
limit, meaning that if all memory access operations are done on the first channel only, there is
a potential bottleneck.
By default, the Mempool Library spreads the addresses of objects among memory channels.
The underlying operating system is allowed to load/unload memory pages at its own discretion.
These page loads could impact the performance, as the process is on hold when the kernel
fetches them.
To avoid these you could pre-load, and lock them into memory with the mlockall() call.
if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
RTE_LOG(NOTICE, USER1, "mlockall() failed with error \"%s\"\n",
strerror(errno));
}
The ring supports bulk and burst access, meaning that it is possible to read several elements
from the ring with only one costly atomic operation (see Ring Library ). Performance is greatly
improved when using bulk access operations.
The code algorithm that dequeues messages may be something similar to the following:
#define MAX_BULK 32
while (1) {
/* Process as many elements as can be dequeued. */
count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK, NULL);
if (unlikely(count == 0))
continue;
my_process_bulk(obj_table, count);
}
The DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode, allowing the factor-
ization of some code for each call in the send or receive function.
Avoid partial writes. When PCI devices write to system memory through DMA, it costs less if
the write operation is on a full cache line as opposed to part of it. In the PMD code, actions
have been taken to avoid partial writes as much as possible.
Traditionally, there is a trade-off between throughput and latency. An application can be tuned
to achieve a high throughput, but the end-to-end latency of an average packet will typically
increase as a result. Similarly, the application can be tuned to have, on average, a low end-to-
end latency, at the cost of lower throughput.
In order to achieve higher throughput, the DPDK attempts to aggregate the cost of processing
each packet individually by processing packets in bursts.
Using the testpmd application as an example, the burst size can be set on the command line
to a value of 16 (also the default value). This allows the application to request 16 packets at
a time from the PMD. The testpmd application then immediately attempts to transmit all the
packets that were received, in this case, all 16 packets.
The packets are not transmitted until the tail pointer is updated on the corresponding TX queue
of the network port. This behavior is desirable when tuning for high throughput because the
cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
effectively hiding the relatively slow MMIO cost of writing to the PCIe* device. However, this
is not very desirable when tuning for low latency because the first packet that was received
must also wait for another 15 packets to be received. It cannot be transmitted until the other
15 packets have also been processed because the NIC will not know to transmit the packets
until the TX tail pointer has been updated, which is not done until all 16 packets have been
processed for transmission.
To consistently achieve low latency, even under heavy system load, the application developer
should avoid processing packets in bunches. The testpmd application can be configured from
the command line to use a burst value of 1. This will allow a single packet to be processed at
a time, providing lower latency, but with the added cost of lower throughput.
Atomic operations imply a lock prefix before the instruction, causing the processor’s LOCK#
signal to be asserted during execution of the following instruction. This has a big impact on
performance in a multicore environment.
Performance can be improved by avoiding lock mechanisms in the data plane. It can often be
replaced by other solutions like per-lcore variables. Also, some locking techniques are more
efficient than others. For instance, the Read-Copy-Update (RCU) algorithm can frequently
replace simple rwlocks.
Small functions can be declared as static inline in the header file. This avoids the cost of a call
instruction (and the associated context saving). However, this technique is not always efficient;
it depends on many factors including the compiler.
The Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely() allow the
developer to indicate if a code branch is likely to be taken or not. For instance:
if (likely(x > 1))
do_stuff();
FORTYEIGHT
The following sections describe methods of profiling DPDK applications on different architec-
tures.
Intel processors provide performance counters to monitor events. Some tools provided by Intel,
such as Intel® VTune™ Amplifier, can be used to profile and benchmark an application. See
the VTune Performance Analyzer Essentials publication from Intel Press for more information.
For a DPDK application, this can be done in a Linux* application environment only.
The main situations that should be monitored through event counters are:
• Cache misses
• Branch mis-predicts
• DTLB misses
• Long latency instructions and exceptions
Refer to the Intel Performance Analysis Guide for details about application profiling.
Iterations that yielded no RX packets (empty cycles, wasted iterations) can be analyzed using
VTune Amplifier. This profiling employs the Instrumentation and Tracing Technology (ITT) API
feature of VTune Amplifier and requires only reconfiguring the DPDK library, no changes in a
DPDK application are needed.
To trace wasted iterations on RX queues, first reconfig-
ure DPDK with CONFIG_RTE_ETHDEV_RXTX_CALLBACKS and
CONFIG_RTE_ETHDEV_PROFILE_ITT_WASTED_RX_ITERATIONS enabled.
Then rebuild DPDK, specifying paths to the ITT header and library, which can be found in any
VTune Amplifier distribution in the include and lib directories respectively:
make EXTRA_CFLAGS=-I<path to ittnotify.h> \
EXTRA_LDLIBS="-L<path to libittnotify.a> -littnotify"
Finally, to see wasted iterations in your performance analysis results, select the “Analyze user
tasks, events, and counters” checkbox in the “Analysis Type” tab when configuring analysis via
315
Programmer’s Guide, Release 17.11.10
VTune Amplifier GUI. Alternatively, when running VTune Amplifier via command line, specify
-knob enable-user-tasks=true option.
Collected regions of wasted iterations will be marked on VTune Amplifier’s timeline as ITT
tasks. These ITT tasks have predefined names, containing Ethernet device and RX queue
identifiers.
The ARM64 architecture provide performance counters to monitor events. The Linux perf
tool can be used to profile and benchmark an application. In addition to the standard events,
perf can be used to profile arm64 specific PMU (Performance Monitor Unit) events through
raw events (-e -rXX).
For more derails refer to the ARM64 specific PMU events enumeration.
The default cntvct_el0 based rte_rdtsc() provides a portable means to get a wall clock
counter in user space. Typically it runs at <= 100MHz.
The alternative method to enable rte_rdtsc() for a high resolution wall clock counter is
through the armv8 PMU subsystem. The PMU cycle counter runs at CPU frequency. However,
access to the PMU cycle counter from user space is not enabled by default in the arm64 linux
kernel. It is possible to enable cycle counter for user space access by configuring the PMU
from the privileged mode (kernel space).
By default the rte_rdtsc() implementation uses a portable cntvct_el0
scheme. Application can choose the PMU based implementation with
CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU.
The example below shows the steps to configure the PMU based cycle counter on an armv8
machine.
git clone https://github.com/jerinjacobk/armv8_pmu_cycle_counter_el0
cd armv8_pmu_cycle_counter_el0
make
sudo insmod pmu_el0_cycle_counter.ko
cd $DPDK_DIR
make config T=arm64-armv8a-linuxapp-gcc
echo "CONFIG_RTE_ARM_EAL_RDTSC_USE_PMU=y" >> build/.config
make
Warning: The PMU based scheme is useful for high accuracy performance profiling with
rte_rdtsc(). However, this method can not be used in conjunction with Linux userspace
profiling tools like perf as this scheme alters the PMU registers state.
FORTYNINE
GLOSSARY
317
Programmer’s Guide, Release 17.11.10
HPET High Precision Event Timer; a hardware timer that provides a precise time reference on
x86 platforms.
ID Identifier
IOCTL Input/Output Control
I/O Input/Output
IP Internet Protocol
IPv4 Internet Protocol version 4
IPv6 Internet Protocol version 6
lcore A logical execution unit of the processor, sometimes called a hardware thread.
KNI Kernel Network Interface
L1 Layer 1
L2 Layer 2
L3 Layer 3
L4 Layer 4
LAN Local Area Network
LPM Longest Prefix Match
master lcore The execution unit that executes the main() function and that launches other
lcores.
mbuf An mbuf is a data structure used internally to carry messages (mainly network packets).
The name is derived from BSD stacks. To understand the concepts of packet buffers or
mbuf, refer to TCP/IP Illustrated, Volume 2: The Implementation.
MESI Modified Exclusive Shared Invalid (CPU cache coherency protocol)
MTU Maximum Transfer Unit
NIC Network Interface Card
OOO Out Of Order (execution of instructions within the CPU pipeline)
NUMA Non-uniform Memory Access
PCI Peripheral Connect Interface
PHY An abbreviation for the physical layer of the OSI model.
pktmbuf An mbuf carrying a network packet.
PMD Poll Mode Driver
QoS Quality of Service
RCU Read-Copy-Update algorithm, an alternative to simple rwlocks.
Rd Read
RED Random Early Detection
RSS Receive Side Scaling
318
Programmer’s Guide, Release 17.11.10
RTE Run Time Environment. Provides a fast and simple framework for fast packet processing,
in a lightweight environment as a Linux* application and using Poll Mode Drivers (PMDs)
to increase speed.
Rx Reception
Slave lcore Any lcore that is not the master lcore.
Socket A physical CPU, that includes several cores.
SLA Service Level Agreement
srTCM Single Rate Three Color Marking
SRTD Scheduler Round Trip Delay
SW Software
Target In the DPDK, the target is a combination of architecture, machine, executive environ-
ment and toolchain. For example: i686-native-linuxapp-gcc.
TCP Transmission Control Protocol
TC Traffic Class
TLB Translation Lookaside Buffer
TLS Thread Local Storage
trTCM Two Rate Three Color Marking
TSC Time Stamp Counter
Tx Transmission
TUN/TAP TUN and TAP are virtual network kernel devices.
VLAN Virtual Local Area Network
Wr Write
WRED Weighted Random Early Detection
WRR Weighted Round Robin
Figures
Fig. 2.1 Core Components Architecture
Fig. 3.1 EAL Initialization in a Linux Application Environment
Fig. 3.2 Example of a malloc heap and malloc elements within the malloc library
Fig. 5.1 Ring Structure
Fig. 5.2 Enqueue first step
Fig. 5.3 Enqueue second step
Fig. 5.4 Enqueue last step
Fig. 5.5 Dequeue last step
Fig. 5.6 Dequeue second step
Fig. 5.7 Dequeue last step
319
Programmer’s Guide, Release 17.11.10
320
Programmer’s Guide, Release 17.11.10
Fig. 36.7 Bucket Search Pipeline for Key Lookup Operation (Single Key Size Hash Tables)
Fig. 17.1 Load Balancing Using Front End Node
Fig. 17.2 Consistent Hashing
Fig. 17.3 Table Based Flow Distribution
Fig. 17.4 Searching for Perfect Hash Function
Fig. 17.5 Divide and Conquer for Millions of Keys
Fig. 17.6 EFD as a Flow-Level Load Balancer
Fig. 17.7 Group Assignment
Fig. 17.8 Perfect Hash Search - Assigned Keys & Target Value
Fig. 17.9 Perfect Hash Search - Satisfy Target Values
Fig. 17.10 Finding Hash Index for Conflict Free lookup_table
Fig. 17.11 EFD Lookup Operation
Fig. 18.1 Example Usages of Membership Library
Fig. 18.2 Bloom Filter False Positive Probability
Fig. 18.3 Detecting Routing Loops Using BF
Fig. 18.4 Vector Bloom Filter (vBF) Overview
Fig. 18.5 vBF for Flow Scheduling to Worker Thread
Fig. 18.6 Using HTSS for Attack Signature Matching
Fig. 18.7 Using HTSS with False Negatives for Wild Card Classification
Tables
Table 33.1 Packet Processing Pipeline Implementing QoS
Table 33.2 Infrastructure Blocks Used by the Packet Processing Pipeline
Table 33.3 Port Scheduling Hierarchy
Table 33.4 Scheduler Internal Data Structures per Port
Table 33.5 Ethernet Frame Overhead Fields
Table 33.6 Token Bucket Generic Operations
Table 33.7 Token Bucket Generic Parameters
Table 33.8 Token Bucket Persistent Data Structure
Table 33.9 Token Bucket Operations
Table 33.10 Subport/Pipe Traffic Class Upper Limit Enforcement Persistent Data Structure
Table 33.11 Subport/Pipe Traffic Class Upper Limit Enforcement Operations
Table 33.12 Weighted Round Robin (WRR)
Table 33.13 Subport Traffic Class Oversubscription
Table 33.14 Watermark Propagation from Subport Level to Member Pipes at the Beginning of
Each Traffic Class Upper Limit Enforcement Period
321
Programmer’s Guide, Release 17.11.10
322