《DPDK Cookbook - Intel® Developer Zone》
《DPDK Cookbook - Intel® Developer Zone》
Featuring:
Overview
The DPDK Cookbook modules teach you everything you need to know to be productive with the Data Plane Development
Kit (DPDK). Here’s an overview of the topics covered:
I highly recommend that you devour the Architecture Overview section of the Programmer’s Guide at dpdk.org. This
excellent document, authored by architects and designers, goes into both the how and the why of DPDK design.
Change is the only constant in this fast-moving field, with some of these components delivering new releases every three
months. Please refer to the related user guides and release notes to be sure you use the latest version when applying
these cookbook recipes. I provide links to many resources, and some of those will inevitably change as well, so please
accept my apology in advance if you encounter a broken link.
Acknowledgements
I’m grateful to many people for their valuable input, including early access customers, architects, design engineers,
managers, platform application engineers, DPDK community, and Intel® Network Builders. In particular, this cookbook is
possible only due to the encouragement, support, and reviews from Jim St Leger, Venky Venkatesan, Tim O’Driscoll, John
DiGiglio, John Morgan, Cristian Dumitrescu, Sujata Tibrewala, Debbie Graham, Ray Kinsella, Jasvinder Singh, Deepak Jain,
Steve Cunming, Heqing Zhu, Dave Hunt, Kannan Ramia Babu, Walt Gilmore, Mike Glynn, Curran Greg, Ai Bee Lim, Larry
Wang, Nancy Yadav, Chiu-Pi Shih, Deepak S, Anand Jyoti, Dirk Blevins, Andrew Duignan, Todd Langley, Joel Auernheimer,
Joel Schuetze, and Eric Heaton.
Auto-config-h.sh
Check-git-log.sh
Check-maintainers.sh
Checkpatches.sh
Cocci.sh
Depdirs-rule.sh
Gen-build-mk.sh
Gen-config-h.sh
Load-devel-config.sh
Relpath.sh
Test-build.sh
Test-null.sh
Validate-abi.sh
We built a DPDK-in-a-Box using the MinnowBoard Turbot* Dual Ethernet Dual-Core, which is a low cost, portable platform
based on the Intel Atom® processor E3826. For the OS, we installed Ubuntu* 16.04 client with DPDK. The instructions in
this document are tested on our DPDK-in-a-Box, an Intel® Core™ i7-5960X processor Extreme Edition brand desktop, and
an Intel® Xeon® Scalable processor. You can use any Intel® architecture platform to build your own device.
For the traffic generator, we use the TRex* realistic traffic generator. The TRex package is self-contained and can be easily
installed.
Any Intel® processor-based platform will work—desktop, server, laptop, or embedded system.
Software
Ubuntu 16.04 Client OS with DPDK installed
TRex Realistic Traffic Generator
Hardware
Our DPDK-in-a-Box uses a MinnowBoard Turbot Dual Ethernet Dual-Core single board computer:
Out of the three Ethernet ports, the two at the bottom are for the traffic generator (dual gigabit Intel® Ethernet
Controller I350). Connect a loopback cable between them.
Connect the third Ethernet port to the Internet (to download the TRex package).
Connect the keyboard and mouse to the USB ports.
Connect a display to the HDMI Interface.
The MinnowBoard Turbot* Dual Ethernet Dual-Core
Insert the microSD card into the microSD Slot. The SD adapter should be ignored and not used.
Power on the DPDK-in-a-Box system. Ubuntu will be up and running right away.
Choose the username test and assign the password tester (or use the username and password specified by the Quick
Start Guide that comes with the platform).
Log on as root and verify that you are in the /home/test directory with the following two commands:
# sudo su
# ls
# ifconfig
You will see the following output. Note down that for port 0 the MAC address is 00:30:18:CB:F2:70 and for port 2
the MAC address is 00:30:18:CB:F2:71.
Note that the first port in the screenshot below, enp2s0, is the port connected to the Internet. No need to make a
note of this.
Here’s what I recorded after these two steps:
Fill the following table with the information you gathered from your specific platform:
MAC address
If you succeeded in using ifconfig to get the port information described above, skip the next section and move on to
the section titled Install the Traffic Generator.
Root Cause
ifconfig is not showing the two ports below. Why?
The reason that ifconfig is unable to find the two ports is possibly because the DPDK application was previously run
and was aborted without releasing the ports, or it might be that a DPDK script runs automatically after boot and claims the
ports. Regardless of the reason, the solution below will enable ifconfig to show both ports.
Solution
1. Run ./setup.sh in the directory /home/test/dpdk/tools.
You can see that two ports are claimed by the DPDK driver.
Select option 30 and then enter the PCI address of device to unbind:
Success!
Above you will see the first port 0000:30:00.0 bound to the kernel.
Repeat steps 3–5 to unbind the second port, 0000:30:00.1, from IGB UIO and bind to IGB.
Use the ifconfig command to show that both ports are bound back to the kernel.
Install the Traffic Generator
In the following sections, we will assume that you successfully found the ports and have noted down the MAC addresses.
Keeping in mind my earlier note that change is the only constant thing in this fast-moving field, refer to the current TRex
user manual to make sure you have the latest script names, directory structure, and release information relevant to this
recipe. Enter the following commands:
# pwd
# mkdir trex
# cd trex
# wget –no-cache http://trex-tgn.cisco.com/trex/release/latest
You should see that the install is complete and saved in /home/test/trex/latest:
The next step is to untar the package:
# tar –xzvf latest
Below you see that version 2.08 is the latest version at the time of this screen capture:
# ls –al
You will see the directory with the version installed. In this exercise, the directory is v2.08, as shown below in response to
the ls –al command. Change directory to the version installed on your system; for example, cd <dir name with
version installed>:
# cd v2.08
# ls –al
You will see the file t-rex-64, which is the traffic generator executable:
# cat /proc/cpuinfo will give you the logical core (lcore) information as shown in the Exercises section.
Why is this information useful?
The command line below that runs the traffic generator uses the –c option to specify the number of lcores to be used for
the traffic generator. You want to know how many lcores exist in the platform. Hence, issuing cat
/proc/cpuinfo and eyeballing the number of lcores that are available in the system will be helpful.
Screen output showing traffic during run (15 packets so far Tx and Rx).
Congratulations! By completing the above hands-on exercise, you have successfully built your own DPDK based traffic
generator.
Next Steps
As a next step, you can connect back-to-back two DPDK-in-a-Box platforms, and use one as a traffic generator and the
other as a DPDK application development and test vehicle.
Exercises
1. How would you configure the traffic generator for different packet lengths?
2. To run the traffic generator forever, what should be the value of –d?
3. How would you measure latency (assuming you have more cores)?
4. Reason out the root cause and find the solution by looking up the error, “Note that the uio or vfio kernel modules
to be used should be loaded into the kernel before running the dodk-devbind.py script” in Chapter 3 of the
DPDK.org document Getting Started Guide for Linux.
For your system, you can use any Intel® platform. The instructions in this article have been tested with an Intel® Xeon®
processor-based desktop, server, and laptop using either the DPDK traffic generator you built from scratch, or a
commercially available DPDK in-a-Box. This is a low-cost, portable platform based on an Intel Atom E3826 processor. At the
time this article was published, it was possible to purchase a DPDK-in-a-Box. Look online if you’re interested in this option.
If you are new to DPDK, spend some time reading the DPDK Programmer’s Guide at dpdk.org.
Because the start of the data path involves polling for packets received, data plane applications need a traffic
generator. But here we have only one platform. How do you configure the application for this traffic?
Solution
This is where the testpmd tx_first parameter comes in handy. When testpmd is started with the
tx_first parameter, the TX function gets executed first—hence the name tx_first—and with an
external cable connecting RX and TX, those packets are now available for the RX function to poll. Thus, you
have achieved running traffic through testpmd without an external traffic generator.
The following screenshots show how to start testpmd and run tx_first with a loopback cable in place.
Starting testpmd
./x86_64-native-linuxapp-gcc/app/testpmd -- -i starts testpmd. –i stands for interactive. Please refer
to the Testpmd Application User Guide at dpdk.org and to the article Testing DPDK Performance and Features with
TestPMD on Intel® Developer Zone for more information about how to build and run testpmd.
While you can use the above command in your specific platform, it is available as a script named run.sh in DPDK-in-a-Box.
./run.sh starts testpmd, as shown below, yielding the testpmd prompt. Look at the flowchart. What does the
initial portion of testpmd do? And what does the runtime portion of testpmd do? Our next step is to initialize
testpmd.
Initialization
Initialization consists of three steps as shown below.
1. EAL (Environment Abstraction Layer)—Find the number of cores and probe PCI devices
2. Initialize memory zones and memory pools
3. Configure the ports for data path operation
Operation
After initialization, data path operations start and continue in a loop. The Poll Mode Driver (PMD) section of the DPDK
Programmer’s Guide is a must-read chapter. It will help you understand and appreciate how you can get the juice out of
your system and achieve the desired throughput.
For example, you can note that the RX descriptor counts are 128, whereas the TX desc descriptors are shown as 512. Why
are they not equal? And why is the TX descriptor four times that of the RX descriptor? Also, analyze the values indicated
for the RX threshold registers: pthresh = 8, hthresh = 8, and wthresh = 4; whereas for the TX registers: pthresh =
8, hthresh = 1, and wthresh = 16.
Reading the data sheet, and more importantly the optimization white papers of the Intel® 82599 10-GbE Ethernet
Controller, at least the Receive and Transmit sections, will help you to understand and make the best use of your knobs.
As shown above, at the testpmd> prompt, enter start tx_first to auto-generate the traffic. Packets are transmitted
when the testpmd> command returns. Please note that packets continue to be generated until they are stopped.
In this case, let it run for 10 to 20 seconds, and then stop the run.
testpmd> stop
Below you can see the RX and TX total packets per port as well as accumulated totals for both ports.
Starting testpmd Without tx_first
What happens when you start testpmd without tx_first? The flowchart below shows that RX starts polling first, so in
this case you need a traffic generator.
We saw that with tx_first, TX gets executed first. This emits packets first and before polling, thus auto-generating the
traffic. This allows us to run testpmd with traffic using a single platform.
I highly recommend that you learn hands-on the functionalities of interest to you and note them in the Exercise section.
Once you are done, quit by using the quit command.
quit, as shown below, does the following:
Summary
At this point, you’ve configured a single system to run the DPDK application and generated tx-first and rx_first
traffic with testpmd. Test your knowledge with the exercises below.
Exercises
1. tx_first auto-generates traffic. How are the parameters of the traffic programmed in this case?
2. You saw the flowchart for the case of without tx_first. Draw the flowchart for the case with tx_first.
3. Note each command-line option and functionality you tried with testpmd, and list what you learned about each one,
with any suggestions you may have.
4. What is the difference between detaching a port and closing a port? Where will you use detaching a port? Where will
you use closing a port?
5. What is the difference between dpdk_nic_bind.py and dpdk-devbind.py? Explain.
6. Search the Internet and the dpdk.org dev mailing list to analyze the root cause and find the solution for the error. Note
that the uio or vfio kernel modules to be used should be loaded into the kernel before running the dpdk-devbind.py
script.
Introduction
The title of this module might just as well be “Build Your Own Software Defined DPDK
Application.” The DPDK packet framework uses a modular building block approach
defined by a configuration file to build complex DPDK applications. For an overview of
the value of the DPDK packet framework, watch the short video Deep Dive into the
Architecture of a Pipeline Stage before you get started with this module.
Here you will build a DPDK packet framework with just two cores—one for master core
tasks and the other to perform DPDK application functions.
For hardware, you can use any IA platform—Intel Xeon brand or Intel Atom brand
desktop, server, or laptop. We will use DPDK-in-a-Box here. This is a low-cost, portable
platform based on the Intel Atom E3826 processor.
Here we are connecting the ports to an external DPDK packet Generator, so we will set the MAC addresses in the DPDK
traffic generator to match those of the external system that will run the DPDK packet framework.
It is left as an exercise for the developers to find out the MAC address and set the traffic generator configuration file with
them.
Update the Configuration File for the DPDK Packet Framework
The DPDK packet framework configuration files provide a software defined modular way to implement complex DPDK
applications. If your system has multiple lcores available for packet processing, you can implement both run-to-completion
as well as pipelined applications. Here, since we have only one core for packet processing with DPDK-in-a-box, we will
showcase the run-to-completion implementation.
Summary
Application developers will benefit by understanding DPDK assumptions on roles / responsibilities of applications. They
need to comprehend the scope of DPDK’s roles / responsibilities to begin with. This helps them to rightly architect from
the get-go obeying DPDK’s assumptions in terms of thread safety, lockless API call usage, multiprocessor synchronization,
and control plane and data plane synchronization.
Exercise
1. Draw your software-defined application block diagram.
For hardware, you can use any IA platform—Intel Xeon brand or Intel Atom brand desktop, server, or laptop. We will use
DPDK -in-a-Box here. This is a low-cost, portable platform based on the Intel Atom E3826 processor.
To build your own DPDK-in-a-Box, please see the earlier module in this cookbook, Build Your Own DPDK Traffic
Generator—DPDK-In-A-Box.
Likewise, to pull out a transceiver, say 10 Gig, and insert a completely different one, say 1 Gig?
If a change of hardware device requires a release of the instance of the device that was removed and creation of a device
instance for the new one, what do you do with threads that are still accessing the data structures of the original instance?
Releasing resources requires coordination with the user space applications using those resources. Applications can be
using a single core or multiple cores. If the resource being released is used by multiple cores, we need to request an
acknowledgement handshake from each core in use, indicating that they are all finished with the resource and it can be
released safely.
Please note that this is only part of the story. The other part is synchronizing with data plane applications that are running,
that is, waiting for a resource to be available, so that APIs to reconfigure can be called. We will look at that and refer to
pointers available in DPDK documentation and source code.
Before we get into these details, let’s step back and look at the big picture:
In the case of a single RX queue per port, only one core at a time can do RX processing.
When you have multiple RX queues per port, each queue can be polled by only one lcore at a time. Thus, if you have 4 RX
queues per port, you can have four cores simultaneously polling the port if you’ve configured one core per queue.
Can You Have Eight Cores and Four RX Queues per Port?
No, since that assigns more than one core per RX queue.
Can You Have Four Cores with Eight RX Queues per Port?
We can only answer this question by knowing full configuration details. Even though you have more RX queues than cores,
if you have configured two cores for any single RX queue, that is not allowed. The key is not having more than one core per
RX queue, irrespective of more queues in total available, compared to the number of cores.
Note that with the port numbering in the system, one lcore can poll multiple RX queues that need not be necessarily
consecutive. This is clear from the figure below. Lcore 0 polls RX Queue 0 and Rx Queue 2. It does not poll RX Queue 1 and
RX Queue 3.
lcore 0
Port 0
RX Queues
Port 1
Port 2
Port 3
Who is Responsible for Mutual Exclusion so that Multiple Cores Don’t Work on the Same Receive Queue?
The one-line answer is—you—the application developer. All the functions of the Ethernet Device API exported by a PMD
are lock-free functions which are not to be invoked in parallel on different logical cores to work on the same target object.
For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same RX queue
[on the same port].
Of course, this function can be invoked in parallel by different logical cores on different RX queues.
Please note and be aware that it is the responsibility of the upper-level application to enforce this rule.
If you don’t design your application to enforce this exclusion, allowing multiple cores to step on each other while accessing
the device, you will get segmentation errors and crashes for sure. DPDK goes with lockless accesses for high performance
and assumes that you, as a higher-level application developer, will ensure that multiple cores do not work on the same
receive queue.
TX Port: Why Should Each Core be Able to Transmit on Each and Every Transmit Port?
We saw that for an RX queue, an lcore can only poll a subset of RX ports, but what about TX ports? Can an lcore connect
only to a subset of TX ports in the system? Or should each and every lcore connect to all TX ports?
The answer is that a forwarding operation running on an lcore may result in a packet destined for any TX port in the
system. Because of this, each lcore should be able to transmit to each and every TX port.
RX Queues
TX Queues
An lcore can poll only a subset of RX ports, but can transmit to any TX port in the system.
While the Data Plane can be Parallel, the Control Plane is Sequential
Control plane operations like device configuration, queue (RX and TX) setup, and device start depend on certain sequences
to be followed. Hence, they are sequential.
After that, the network application can invoke, in any order, the functions exported by the Ethernet API to get the MAC
address of a given device, the speed and the status of a device physical link, receive/transmit packet bursts, and so on.
Summary
Application developers will benefit from understanding DPDK assumptions regarding application roles and responsibilities.
To start, it’s important to comprehend the scope of DPDK’s roles and responsibilities. This will help you to correctly
architect from the get-go in terms of thread safety, lockless API call usage, multiprocessor synchronization, and control
plane and data plane synchronization.
Next Steps
Architect a couple of your own usage models of the data plane coexisting with the control and management plane. Look
for similar approaches used by testpmd and other applications, and described by the DPDK HowTo Guides. Test them out.
Exercises
1. Can you have eight cores per port with four RX queues per port?
2. Can you have four cores per port with eight RX queues per port?
3. What are the implications of multiple cores transmitting on one transmit port—in terms of control plane and data
plane synchronization?
4. Control plane operations—should it be done in interrupt context itself or as a deferred procedure?
For cookbook-style instructions on how to do hands-on performance profiling of your DPDK code with VTune™ tools, refer
to the module Profiling DPDK Code with Intel VTune Amplifier.
Once the particular hotspot has been addressed, the application is again profiled to find the next hotspot in the system.
The above methodology is repeated to the point of satisfaction in terms of achieving desired performance.
The performance optimization involves a gamut of considerations shown in the checklist below:
BIOS Settings
To get repeatable performance, DPDK L3fwd performance numbers are achieved with the following BIOS settings:
NUMA ENABLED
Refer to Intel document # 557159 titled Intel Xeon processor E7-8800/4800 v3 Product Family, for detailed understanding
of BIOS setting and performance implications.
Platform Optimizations
Platform optimizations include (1) configuring memory, and (2) I/O (NIC Cards), to take advantage of affinity to achieve
lower latency.
For example, as shown below, CPU0 is reading 256 bytes (four cache lines). With the BIOS NUMA state set to DISABLED,
memory controller interleaves the access across the sockets. Out of 256 bytes, 128 bytes are read from local memory and
128 bytes are read from remote memory.
The remote memory accesses end up crossing the Intel QPI link. The impact of this is increased time for accessing remote
memory, resulting in lower performance.
Solution: As shown below, with BIOS setting NUMA = Enabled, all the accesses go to the same socket (local) memory and
there is no crossing of Intel QPI. This results in improved performance due to lower memory access latency.
Linux* Optimizations
Reducing Context Switches with isolcpus
To reduce the possibility of context switches, it is desirable to give a hint to the kernel to refrain from scheduling other
user space tasks on to the cores used by DPDK application threads. The isolcpus Linux kernel parameter serves this
purpose. For example, if DPDK applications are to run on logical cores 1, 2, and 3, the following should be added to the
kernel parameter list:
isolcpus=1,2,3
Note: Even with the isolcpus hint, the scheduler may still schedule kernel threads on the isolated cores. Please note that
isolcpus requires a reboot.
Recommendation: The good news is that each sample application comes with not only optimized code flow but also
optimized parameters settings as default values. The recommendation is to use a similar ratio between resources for Tx
and Rx. The following are the references and recommendations for the Intel® 82599 10 Gigabit Ethernet Controller. For
other NIC controllers, please refer to the corresponding data sheets.
The following graph (from the above white paper) indicates that you should not use more than two to four queues per
port since the performance degrades with a higher number of queues.
For the best-case scenario, the recommendation is to use one queue per port. In case more are needed, two queues per
port can be considered, but not more than that.
Ratio of the forwarding rate varying the number of hardware queues per port.
Can Tx Resources be Allocated the Same Size as Rx Resources?
Please use as per the default values that are used in the application. For example, for Intel 82599 10-GbE Ethernet
Controller, the default values are not equal; whereas for XL710, both RX and TX descriptors are of equal size.
Intel 82599 10-GbE Ethernet Controller: It is a natural tendency to allocate equal-sized resources for Tx and Rx. However,
please note that http://git.dpdk.org/dpdk/tree/examples/l3fwd/main.c shows that optimal default size for the number of
Tx ring descriptors is 512 as opposed to Rx ring descriptors being 128. Thus, the number of Tx ring descriptors is four times
that of the Rx ring descriptors.
The recommendation is to choose Tx ring descriptors four times the size of Rx ring descriptors and not to have them both
equal size. The reasoning for this is left as an exercise for the readers to find out.
Please refer to Intel 82599 10-Gigabit Ethernet Controller: Datasheet for detailed explanations.
Rx_Free_Thresh—In Detail
As shown below, communication of packets received by the hardware is done using a circular buffer of packet descriptors.
There can be up to 64 K-8 descriptors in the circular buffer. Hardware maintains a shadow copy that includes those
descriptors completed but not yet stored in memory.
The Receive Descriptor Head register (RDH) indicates the in-progress descriptor.
The Receive Descriptor Tail register (RDT) identifies the location beyond the last descriptor that the hardware can process.
This is the location where software writes the first new descriptor.
During runtime, the software processes the descriptors and upon completion of a descriptor, increments the Receive
Descriptor Tail (RDT) registers. However, updating the RDT after each packet has been processed by the software has a
cost, as it increases PCIe operations.
Rx_free_thresh represents the maximum number of free descriptors that the DPDK software will hold before sending them
back to the hardware. Hence, by processing batches of packets before updating the RDT, we can reduce the PCIe cost of
this operation.
Fine tune with the parameters in the rte_eth_rx_queue_setup ( ) function for your configuration:
1 ret =
rte_eth_rx_queue_setup(portid,
0, rmnb_rxd,
2 socketid, &rx_conf, 3
mbufpool[socketid]);
Compile With the Correct Optimization Flags
Apply the corresponding solution: Software prefetch for memory, block mode for I/O, to use Intel HT Technology for CPU-
bound applications.
Software prefetch for memory helps to hide memory latency and thus improves memory-bound tasks in data plane
applications.
PREFETCHW
Prefetch data into cache in anticipation of write: PREFETCHW, a new instruction from Intel® Xeon® processor E5-2650 v3
onward, hides memory latency and improves the network stack. PREFETCHW prefetches data into the cache in anticipation
of a write.
PREFETCHWT1
Prefetch hint T1 (temporal L1 cache) with intent to write: PREFETCHWT1 fetches the data to a location in the cache
hierarchy specified (T1 => temporal data with respect to first-level cache) by an intent to write a hint (so that data is
brought into Exclusive state via a request for ownership) and a locality hint.
T1 (temporal data with respect to first-level cache)—prefetches data into the second-level cache.
For more information about these instructions refer to the Intel® 64 and IA-32 Architectures Developer’s Manual.
Problem: One may (mistakenly), assume core 0 and core 1 are neighboring cores and may choose the coremask
accordingly in the DPDK command-line parameter. Please note that these logical core numbers, and their mapping to
specific cores on specific NUMA sockets, can vary from platform to platform. While in one platform core 0 and core 1 may
be neighbors, in another platform, core 0 and core 1 may end up being across another socket.
For instance, in a single-socket machine (screenshot shown below), lcore 0 and lcore 4 are siblings of the same physical
core (core 0). So, the communication cost between lcore 0 and lcore 4 will be less than the communication cost between
lcore 0 and lcore 1.
Solution: Because of this, it is recommended that the core layout for each platform be considered when choosing the
coremask to use in each case.
Tools—dpdk/tools/cpu_layout.py
Use ./cpu_layout.py in the tools directory to find out the socket ID, the physical core ID, and the logical core ID (processor
ID). From this information, correctly fill in the coremask parameter with locality of processors in mind.
The list of physical cores is [0, 1, 2, 3, 4, 8, 9, 10, 11, 16, 17, 18, 19, 20, 24, 25, 26, 27]
Please note that physical core numbers 5, 6, 7, 12, 13, 14, 15, 21, 22, 23 are not in the list. This indicates that one cannot
assume that the physical core numbers are sequential.
How to find out which lcores are using Intel HT Technology from the cpu_layout?
In the picture below, Lcore 1 and lcore 37 are hyper threads in socket 0. Assigning intercommunicating tasks to lcore 1 and
lcore 37 will have lower cost and higher performance compared to assigning tasks to lcore 1 with any other core (other
than lcore 37).
Save core 0 for Linux use and do not use core 0 for the DPDK.
Refer below for the initialization of the DPDK application. Core 0 is being used by the master core.
Do not use core 0 for the DPDK applications because it is used by Linux as the master core. For example, using l3fwd –c 0x1
… should be avoided since that would be using core 0 (which is serving the functionality of the master core) for l3fwd DPDK
application as well.
Instead, the command l3fwd –c 0x2 …. can be used so that the l3fwd application uses core 1.
In realistic use cases like Open vSwitch* with DPDK, a control plane thread pins to the master core and is responsible for
responding to control plane commands from the user or the SDN controller. So, the DPDK application should not use the
master core (core 0), and the core bit mask in the DPDK command line should not set bit 0 for the coremask.
The following are a few sample capabilities of distributor micro-benchmarks for performance evaluation.
Time_cache_line_switch ()
How can I measure the time taken for a cache line round-trip between two cores and back again?
Perf_test()
How can I measure the processing time per packet?
Running ring_perf_auto_test in /app/test gives the number of CPU cycles, which enables you to study the performance
difference between single producer/single consumer and multi-producer/multi-consumer. It also shows the differences for
different bulk sizes. See the following screenshot output.
The key takeaway: Using sp/sc with higher bulk sizes gives higher performance.
Please note that even though the default ring_perf_autotest runs through the performance test with block sizes of 8 and
32, one can update the source code to include other desired sizes (modify the array bulk_sizes[] to include bulk sizes of
interest). For instance, find below the output with the block sizes 1, 2, 4, 8, 16, and 32.
Hash Function Operation Key Size (bytes) Entries Entries per Bucket
a) Jhash
a) Add on Empty a) 16 a) 1024, a) 1
b) Rte_hash_CRC b) Add Update b) 32 b) 1048576 b) 2
c) Look up c) 48 c) 4
d) 64 d) 8
e) 16
The Detailed Test Output section contains detailed test output and the commands you can use to evaluate
performance with your platform. The summary of the result is tabulated and charted below:
DPDK Micro-Benchmarks and Auto-Tests
Focus Area to
Use These Micro-Benchmarks and Auto-Tests
Improve
1 Ring for Inter- Performance comparison of bulk enqueue/bulk dequeue versus single enqueue/single dequeue
Core on a single core
Communication To measure and compare performance between Intel® HT Technology, cores, and sockets doing
bulk enqueue/bulk dequeue on pairs of cores
Performance of dequeue from an empty
ring: http://git.dpdk.org/dpdk/tree/test/test/test_ring_perf.c
Tx Burst - http://git.dpdk.org/dpdk/tree/test/test/test_ring.c
Rx Burst - http://git.dpdk.org/dpdk/tree/test/test/test_pmd_ring.c
2 Memcopy Cache to cache
Cache to memory
Memory to memory
Memory to cache
http://git.dpdk.org/dpdk/tree/test/test/test_memcpy_perf.c
3 Mempool “n_get_bulk”, “n_put_bulk”
1 core, 2 cores, max cores with cache objects
1 core, 2 cores, max cores without cache objects
http://git.dpdk.org/dpdk/tree/test/test/test_mempool.c
5 Hash Rte_jhash, rte_hash_crc;
Add
Lookup
Update
http://git.dpdk.org/dpdk/tree/test/test/test_hash_perf.c
6 ACL Lookup http://git.dpdk.org/dpdk/tree/test/test/test_acl.c
7 LPM Rule with depth > 24 1) Add, 2) Lookup, 3)
Delete http://git.dpdk.org/dpdk/tree/test/test/test_lpm.c
http://git.dpdk.org/dpdk/tree/test/test/test_lpm6.c
Large Route Tables:
http://git.dpdk.org/dpdk/tree/test/test/test_lpm6_data.h
8 Packet http://git.dpdk.org/dpdk/tree/test/test/test_distributor_perf.c
Distribution
9 NIC I/O Measure Tx Only
Benchmark Measure Rx Only,
Measure Tx and Rx
Benchmarks Network I/O Pipe - NIC h/w + PMD
http://git.dpdk.org/dpdk/tree/test/test/test_pmd_perf.c
10 NIC I/O + Increased CPU processing – NIC h/w + PMD + hash/lpm Examples/l3fwd
Increased CPU
processing
11 Atomic http://git.dpdk.org/dpdk/tree/test/test/test_atomic.c
Operations/
Lock-rd/wr http://git.dpdk.org/dpdk/tree/test/test/test_rwlock.c
12 SpinLock Takes global lock, displays something, then releases the global lock
Takes per-lcore lock, displays something, then releases the per-core lock
http://git.dpdk.org/dpdk/tree/test/test/test_spinlock.c
13 Software http://git.dpdk.org/dpdk/tree/test/test/test_prefetch.c
Prefetch Its usage: http://git.dpdk.org/dpdk/tree/lib/librte_table/rte_table_hash_ext.c
14 Packet http://git.dpdk.org/dpdk/tree/test/test/test_distributor_perf.c
Distribution
15 Reorder and http://git.dpdk.org/dpdk/tree/test/test/test_reorder.c
Seq. Window
16 Software Load http://git.dpdk.org/dpdk/tree/examples/load_balancer
Balancer
17 ip_pipeline Using the packet framework to build a pipeline:
http://git.dpdk.org/dpdk/tree/test/test/test_table.c
ACL Using Packet Framework
http://git.dpdk.org/dpdk/tree/test/test/test_table_acl.c
18 Re-entrancy http://git.dpdk.org/dpdk/tree/test/test/test_func_reentrancy.c
19 mbuf http://git.dpdk.org/dpdk/tree/test/test/test_mbuf.c
20 memzone http://git.dpdk.org/dpdk/tree/test/test/test_memzone.c
21 Virtual PMD http://git.dpdk.org/dpdk/tree/test/test/virtual_pmd.c
22 QoS http://git.dpdk.org/dpdk/tree/test/test/test_meter.c
http://git.dpdk.org/dpdk/tree/test/test/test_red.c
http://git.dpdk.org/dpdk/tree/test/test/test_sched.c
23 Link Bonding http://git.dpdk.org/dpdk/tree/test/test/test_link_bonding.c
24 Kni 1. Transmit
2. Receive to / from kernel space
3. Kernel requests
http://git.dpdk.org/dpdk/tree/test/test/test_kni.c
25 Malloc http://git.dpdk.org/dpdk/tree/test/test/test_malloc.c
26 Debug http://git.dpdk.org/dpdk/tree/test/test/test_debug.c
27 Timer http://git.dpdk.org/dpdk/tree/test/test/test_cycles.c
28 Alarm http://git.dpdk.org/dpdk/tree/test/test/test_alarm.c
Compiler Optimizations
Reference: PySter*—Compiler design and construction—“Adding optimizations to a compiler is a lot like eating
chicken soup when you have a cold. Having a bowl full never hurts, but who knows if it really helps. If the
optimizations are structured modularly so that the addition of one does not increase compiler complexity, the
temptation to fold in another is hard to resist. How well the techniques work together or against each other is hard
to determine."
Challenge: If you are writing code bypassing these standard synchronization primitives for optimization purposes,
then consider your requirement in using the proper barrier.
Consideration: x86 provides a process ordering memory model in which writes from a given CPU are seen in order
by all CPUs, and weak consistency, which permits arbitrary reordering, limited only by explicit memory-barrier
instructions.
The smp_mp ( ), smp_rmb ( ), smp_wmb ( ) primitives also force the compiler to avoid any optimizations that would
have the effect of reordering memory optimizations across the barriers.
Some Intel® Streaming SIMD Extensions (SSE) instructions are weakly ordered (clflush and non-temporal move
instructions). CPUs that have SSE can use mfence for smp mb(), lfence for smp rmb(), and sfence for smp wmb().
The key takeaway: The cost for RX+TX cycles per packet in test Polled Mode Driver is 54 cycles
with 4 ports and –n = 4 memory channels.
What if you need to find the cycles taken for only RX? Or only TX?
To find RX-only time, use the command set_rxtx_anchor rxonly before issuing the command pmd_perf_autotest.
Similarly, to find TX-only time, use the command set_rxtx_anchor txonly before issuing the command
pmd_perf_autotest.
a) with cache a) 32
a) One Core a) 1 a) 1
object b) 128
b) Two Cores b) 4 b) 4
b) without cache
c) Max. Cores c) 32 c) 32
object
a) 0 Timer - Appending
b) 100 Timers - Callback
c) 1000 Timers - Resetting
d) 10,000 Timers
e) 100,000 Timers
f) 1,000,000 Timers
Introduction
Performance is a key factor in designing and shipping best of class products. Optimizing performance
requires visibility into system behavior. In this module, we’ll learn how to use Intel VTune Amplifier to
profile Data Plane Development Kit (DPDK) code.
You will find this module to be a comprehensive reference for installing and use of the Intel VTune
Amplifier, and will learn how to run some DPDK micro benchmarks as an example of how to get deep
visibility into system, cores, communication, and core pipeline and usage.
Extensive screenshots are provided for comparison with your output. The commands are given, in
addition, so that the readers can copy and paste wherever possible.
Outline
This module walks you through the following steps to get started using Intel VTune Amplifier with a DPDK
application.
• Install Linux
• Install Data Plane Development Kit (DPDK)
• Install the tools
o Source editor
o Intel VTune Amplifier
• Install and profile the application of your choice
o Distributor application
o Ring tests application
• Conclusion and next steps
Install Linux
Install from the Linux DVD with an ISO image:
http://old-releases.ubuntu.com/releases/15.04/ubuntu-15.04-desktop-amd64.iso
Prior to Install
If you have a laptop installed with Windows* 8, go to safe mode (SHIFT+RESTART). Once in safe mode,
choose boot option # 1 to boot from the external USB DVD drive. Restart and install.
After Install
1. Verify whether the kernel version installed is the correct version as per the DPDK release notes.
$uname –a
The above output verifies the kernel release as 3.19.0-59-generic, the version number as #66, and the
distro as Ubuntu 64 bit.
$uname –v
Install DPDK
Download the DPDK
3. Get the latest DPDK release, as shown below and in the screenshot.
$ sudo wget www.dpdk.org/browse/dpdk/snapshot/dpdk-16.04.tar.xz
You will find the DPDK tar file downloaded, as shown below.
$ ls
Install CSCOPE.
$ sudo apt-get install cscope
3. Add the following line to /etc/apt/sources.list.d/ddebs.list as shown below and save it.
deb http://ddebs.ubuntu.com/ vivid main restricted universe
multiverse
4. Update the system to load the package list from the new repository.
If you don’t see the resolution error in your system, skip the instructions that follow and proceed to
the next section.
Add to the file the two name servers (below) as seen in the example below, and save the file.
6. Restart the service. It is necessary to do this before the step that follows, or you’ll still see the resolve
error.
$ sudo /etc/init.d/resolveconf restart
With the above steps, access to http://ddebs.ubuntu.com has been resolved. However there is a new
error, GPG error, as shown at the bottom of the screenshot below.
9. With the repository added, the next step is to install the symbol package by running the following
command:
apt-get install linux-image-<release>-dbgsym=<release>.<version>
Please note that the above resulted in an error because it could not locate the package linuximage-
3.19.0-59-generic-dbgsym. If you want to set breakpoints by function names and viewing local
variables, this error must be resolved.
Algorithm Analysis
Run Basic Hotspots analysis type to understand application flow and identify sections of
code that get a lot of execution time (hotspots).
Use the algorithm Advanced Hotspots analysis to extend Basic Hotspots analysis by
collecting call stacks and analyze the CPI (Cycles Per Instructions) metric. NEW: You can
also use this analysis type to profile native or Java* applications running in a Docker*
container on a Linux system.
Use Memory Consumption analysis for your native Linux or Python* targets to explore
RAM usage over time and identify memory objects allocated and released during the
analysis run.
Run Concurrency analysis to estimate parallelization in your code and understand how
effectively your application uses available cores.
Run Locks and Waits analysis to identify synchronization objects preventing effective
utilization of processor resources.
Microarchitecture Analysis
Run General Exploration analysis to triage hardware issues in your application. This
type collects a complete list of events for analyzing a typical client application.
Use Memory Access analysis to identify memory-related issues, like NUMA problems
and bandwidth limited accesses, and attribute performance events to memory objects
(data structures), which is provided due to instrumentation of memory allocations/de-
allocations and getting static/global variables from symbol information.
For systems with Intel® Software Guard Extensions (Intel® SGX) feature enabled, run
SGX.
Run Hotspots analysis to identify performance-critical program units inside security
enclaves. This analysis type uses the INST_RETIRED.PREC_DIST hardware event that
emulates precise clock ticks, which is mandatory for the analysis on the systems with
Intel SGX enabled.
Run System Overview analysis to review general behavior of a target Linux or Android*
system and correlate power and performance metrics with the interrupt request (IRQ).
Run CPU/GPU Concurrency analysis to identify code regions where your application is
CPU- or GPU-bound.
Use GPU Hotspots analysis to identify GPU tasks with high GPU utilization, and estimate
the effectiveness of this utilization.
For GPU-bound applications running on Intel® HD Graphics, collect GPU hardware
events to estimate how effectively the processor graphics are used.
Collect data on ftrace* events on Android and Linux targets and Atrace* events on
Android targets.
Analyze hot Intel® Media SDK programs and OpenCL™ kernels running on a GPU. For
OpenCL application analysis, use the architecture diagram to explore GPU hardware
metrics per GPU architecture blocks.
Run Disk Input and Output analysis to monitor utilization of the disk subsystem, CPU,
and processor buses. This analysis type provides a consistent view of the storage
subsystem combined with hardware events and an easy-to-use method to match user-
level source code with I/O packets executed by the hardware.
Compute-Intensive Applications Analysis
Run HPC Performance Characterization analysis to identify how effectively your high-
performance computing application uses CPU, memory, and floating-point operation
hardware resources. This analysis type provides additional scalability metrics for
applications that use OpenMP* or Intel® MPI Library runtimes.
Run an algorithm analysis type with the Analyze OpenMP regions option enabled to
collect OpenMP or Intel MPI data for applications using OpenMP or Intel MPI runtime
libraries. Note that HPC Performance Characterization analysis has the option enabled
by default.
For OpenMP applications, analyze the collected performance data to identify
inefficiencies in parallelization. Review the potential gain metric values per OpenMP
region to understand the maximum time that could be saved if the OpenMP region is
optimized to have no load imbalance, assuming no runtime overhead.
For hybrid OpenMP and Intel MPI applications, explore OpenMP efficiency metrics by
Intel MPI processes laying on the critical path.
Source Analysis
Double-click a hotspot function to drill down to the source code and analyze
performance per source line or assembler instruction. By default, the hottest line is
highlighted.
For help on an assembly instruction, right-click the instruction in the Assembly pane and
select Instruction Reference from the context menu.
Configure target options for managed code analysis in the native, managed, or mixed
mode:
Windows host only: Event-based sampling (EBS) analysis for Windows Store C/C++, C#
and JavaScript* applications running in the Attach or System-wide mode.
EBS or user-mode sampling and tracing analysis for Java applications running in the
Launch Application or Attach mode.
Basic Hotspots and Locks and Waits analysis for Python applications running in the
Launch Application and Attach to Process modes.
Custom Analysis
Select the Custom Analysis branch in the analysis tree to create your own analysis
configurations using any of the available VTune Amplifier data collectors.
Run your own custom collector from the VTune Amplifier to get the aggregated
performance data from your custom collection and VTune Amplifier analysis in the
same result.
Import performance data collected by your own or third-party collector into the VTune
Amplifier result collected in parallel with your external collection. Use the Import from
CSV button to integrate the external data to the result.
Collect data from a remote virtual machine by configuring KVM guest OS profiling,
which makes use of the Linux Perf KVM feature. Select Analyze KVM guest OS from the
Advanced options.
(Linux and Windows targets) Native performance analysis with the VTune Amplifier
graphical or command line interface installed on the target system. Analysis is started
directly on the target system.
(Linux and Windows targets) Native hardware event-based sampling analysis with the
VTune Amplifier's Sampling Enabling Product (SEP) installed on the target embedded
system.
So here, even before running the DPDK application, we run top –H to see where the CPU is spending its
cycles without our specific application running.
Below you will see the VTune Amplifier showing top –H and the Firefox* web browser running. Now,
top is something you just ran, whereas Firefox is something you don’t want taking CPU cycles while you
evaluate your application of interest. Similarly, you may find some unwanted daemons. So at this point,
stop any unwanted applications, daemons, and other components.
Pointing to the Source Directory
The following screenshot shows how to point to the source directory of the software components of
interest in VTune Amplifier. You can add multiple directories.
Profiling DPDK Code with VTune Amplifier
1. First, we’ll reserve huge pages. Note that we’ve chosen 128 huge pages here to accommodate a
possible memory constraint when testing on a laptop. If you’re using a server or desktop, you can
specify 1024 huge pages.
$ cd /home/dpdk/dpdk-16.04
$ sudo su
$ sudo bash
$ mkdir –p –v /mnt/huge [-v for verbose, as you can see below
response from the system]
$ mount –t hugetlbfs nodev /mnt/huge
Making the mount point permanent across reboots, by adding the
following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0
Look at /etc/fstab to confirm that /mt/huge was successfully created and mounted. See example below:
3. Build the DPDK test application and DPDK library:
$ export RTE_SDK=/home/dpdk/dpdk-16.04
$ export RTE_TARGET=x86_64-native-linuxapp-gcc
$ sudo su
$ ./test
The test will issue prompt RTE>> as shown below. Enter ? for help and the list of available tests.
Profiling Distributor Perf Autotest
Our first test will be the distributor_perf_autotest. A diagram describing this application is
below.
See below for command window output during the test run.
The VTune Amplifier summary highlights CPI rate, indicating it is beyond the normal range. It also highlights
Back End-Bound, indicating a memory-bound application nature. See these results on the screen capture
below.
Analysis Details
Function/Call Stack indicates rte_distributor_poll_pkt consumes CPI at a rate of 3.720 and
mm_pause consumes CPI at a rate of 3.867.
You can observe that rte_distributor_get_pkt runs with a CPI rate of 26.30. However, it is not
highlighted, since it uses fewer clock ticks than the highlighted functions.
You will see other functions listed here along with the CPI each one uses, for example:
rte_distributor_process, rte_distributor_request_pkt,
time_cache_line_switch.
Profiling Rings
Communication between cores for interprocessor communication as well as communication between cores
and NIC happens through rings and descriptors.
While NIC hardware does optimizations in terms of RS bit and descriptor done bit (DD bit) in bunching the
data size, DPDK in addition enhances bunching with amortizing by offering API for bulk communication
through rings. The graphic below illustrates ring communication.
The rings tests show that single producer/single consumer (SP/SC) with bulk sizes both in
enqueue/dequeue gives best performance compared to multiple producers/multiple consumers
(MP/MC). Below are the steps.
Profiling ring_perf_autotest
In RTE, select ring_perf_autotest. Test output is shown in the cmd window below.
VTune Amplifier output for ring_perf_autotest shows in detail that the code is backend-bound. You
can see the call stack showing results for SP/SC with bulk sizes as well as MP/MC.
To appreciate the relative performance of SP/SC with single data size and bulk size, and comparing with
MP/MC with single data size and bulk size, refer to the following graph. Please note the impact of core
placement—a) siblings, b) within the same socket, c) across multisockets.
Conclusion and Next Steps
Practice profiling on additional sample DPDK applications. With the experience you gather, extend profiling
and optimization to the applications you are building on top of DPDK.
Get plugged in to the DPDK community to learn the latest from developers and architects and keep your
products highly optimized. Register at https://www.dpdk.org/contribute/.
References
Enabling Internet connectivity: http://askubuntu.com/questions/641591/internet-connection-not-working-
in-ubuntu-15-04
Additional Tools
The previous module helped you to understand how VTune Amplifier can help analyze performance of your
DPDK application. In this module we describe two other tools that you might find helpful.
Intel® Memory Latency Checker
Memory latency has to do with the time used by an application to fetch data from the processor’s cache
hierarchy and memory subsystem. Intel® Memory Latency Checker (Intel® MLC) measures memory latency
and bandwidth under load, with options for more detailed analysis of memory latency between a set of
cores to memory or cache.
Features
By default, Intel MLC identifies system topology and generates the following:
A matrix of idle memory latencies for requests originating from each of the sockets and addressed to
each of the available sockets.
Peak memory bandwidth measurement of requests containing varying numbers of reads and writes to
local memory.
A matrix of memory bandwidth values for requests originating from each of the sockets and addressed
to each of the available sockets.
Latencies at different bandwidth points.
Cache to cache data transfer latencies.
For more information on basic operation of Intel MLC as well as coverage of the command options that
enable finer-grained analysis, read the article Intel Memory Latency Checker v3.5. It describes the
functionality of the most recent version of Intel MLC in detail, and includes download and installation
instructions.
Screenshots
Processor Counter Monitor* (PCM) is an open source project that includes a programming API as well as
several command-line utilities for gathering real-time performance and power metrics for Intel® Core™
processors, Intel Xeon processors, Intel Atom processors, and Intel® Xeon Phi™ processors. It supports
Linux, Windows, and several other operating systems. For detailed information, and to download, visit the
PCM GitHub* repository.
Of the several tools included as part of PCM, which are recommended for use with DPDK? The list below
offers some suggestions. If your application is:
Screenshots
Intel MLC and PCM are handy, easy to use tools that you might find useful. VTune Amplifier is much more
powerful and versatile. If you haven’t used VTune Amplifier, download a free trial copy at the Intel VTune
Amplifier home page.
Acknowledgements
This cookbook is possible only with the whole team’s effort and all the encouragement, support, and review
from each and every one in the internal divisions as well as early access customers, network developers,
and managers.
Notices
Intel technologies’ features and benefits depend on system configuration and may require enabled
hardware, software or service activation. Performance varies depending on system configuration.
Check with your system manufacturer or retailer or learn more at intel.com.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is
granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied
warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All
information provided here is subject to change without notice. Contact your Intel representative to
obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause
deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be
obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
This sample source code is released under the Intel Sample Source Code License Agreement.
Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, Intel Xeon Phi, VTune, and Xeon are
trademarks of Intel Corporation in the U.S. and/or other countries.
Java is a registered trademark of Oracle and/or its affiliates. OpenCL and the OpenCL logo are
trademarks of Apple Inc. used by permission by Khronos.
Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft
Corporation in the United States and/or other countries.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
*Other names and brands may be claimed as the property of others.
© 2018 Intel Corporation