KEMBAR78
《DPDK Cookbook - Intel® Developer Zone》 | PDF | Program Optimization | Network Interface Controller
0% found this document useful (0 votes)
148 views107 pages

《DPDK Cookbook - Intel® Developer Zone》

Uploaded by

lunxu090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views107 pages

《DPDK Cookbook - Intel® Developer Zone》

Uploaded by

lunxu090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

DPDK Cookbook

Featuring:

Solution-Oriented Mini Sections

User-Friendly Screen shots

Links to Videos and Online Contents

Overview

The DPDK Cookbook modules teach you everything you need to know to be productive with the Data Plane Development
Kit (DPDK). Here’s an overview of the topics covered:

 Build Your Own DPDK Traffic Generator—DPDK-In-A-Box


 DPDK Transmit and Receive—DPDK-in-a-Box
 Build Your Own DPDK Packet Framework with DPDK-In-A-Box
 DPDK Data Plane—Multicores and Control Plane Synchronization
 DPDK Performance Optimization Guidelines White Paper
 Profiling DPDK Code with Intel® VTune™ Amplifier
 References

I highly recommend that you devour the Architecture Overview section of the Programmer’s Guide at dpdk.org. This
excellent document, authored by architects and designers, goes into both the how and the why of DPDK design.
Change is the only constant in this fast-moving field, with some of these components delivering new releases every three
months. Please refer to the related user guides and release notes to be sure you use the latest version when applying
these cookbook recipes. I provide links to many resources, and some of those will inevitably change as well, so please
accept my apology in advance if you encounter a broken link.

Acknowledgements
I’m grateful to many people for their valuable input, including early access customers, architects, design engineers,
managers, platform application engineers, DPDK community, and Intel® Network Builders. In particular, this cookbook is
possible only due to the encouragement, support, and reviews from Jim St Leger, Venky Venkatesan, Tim O’Driscoll, John
DiGiglio, John Morgan, Cristian Dumitrescu, Sujata Tibrewala, Debbie Graham, Ray Kinsella, Jasvinder Singh, Deepak Jain,
Steve Cunming, Heqing Zhu, Dave Hunt, Kannan Ramia Babu, Walt Gilmore, Mike Glynn, Curran Greg, Ai Bee Lim, Larry
Wang, Nancy Yadav, Chiu-Pi Shih, Deepak S, Anand Jyoti, Dirk Blevins, Andrew Duignan, Todd Langley, Joel Auernheimer,
Joel Schuetze, and Eric Heaton.

About the Author


Muthurajan Jayakumar (M Jay) has worked with the DPDK team since 2009. He joined Intel in 1991 and
has worked in various roles and divisions with Intel, including roles as a 64-bit CPU front side bus
architect, and as a 64-bit HAL developer. M Jay holds 21 US patents, both individually and jointly, all
issued while working at Intel. M Jay was awarded the Intel Achievement Award in 2016, Intel's highest
honor based on innovation and results. Before joining Intel, M Jay architected a CPU node board for a
1000-node machine design in India. M Jay won a gold medal for graduating with the university first rank ECE batch in
1984, from TCE, Madurai.

Please send your feedback about the DPDK Cookbook to Muthurajan.Jayakumar@intel.com.

Getting Started: Documentation and Tools


Key Documentation
Dpdk.org contains a rich set of documentation. The table below highlights some of the guides and other materials that
you’ll find useful to familiarize yourself with DPDK programming. You can read the guides online or download many of
them in PDF form.
Programmers Guide The guide you must read first.
Quick Start Guide Simple forwarding test with pcap PMD, which works with
any NIC.
API Documentation All libraries and APIs.
Supported NICs The list of supported NICs grows with each new release of DPDK.
Please refer to this document for the latest list.
Network Interface Controller Poll mode drivers for supported NICs—virtual as well as
Drivers physical.
DPDK Sample Application More than 40 sample applications—find the closest match
User Guide to your application.
DPDK Testpmd Application The key DPDK tool with port, NIC set, and show commands.
User Guide
Release Notes The latest features, issues addressed, and issues to be
addressed in the future.
Getting Started Guide for Build, install, and Getting Started.
Linux*
How-to Guides Covers topics such as live migration of a VM with SR-IOV VF,
live migration of a VM with Virtio on host running
vhost_user, a flow bifurcation guide, and more.
Crypto Device Drivers Contains Crypto Device Supported Functionality Matrices
and details about support for many drivers, including:
 AESN-NI Multi Buffer Crypto Poll Mode Driver
 AES-NI GCM Crypto Poll Mode Driver
 KASUMI Crypto Poll Mode Driver
 Null Crypto Poll Mode Driver
 SNOW 3G Crypto Poll Mode Driver
 Quick Assist Crypto Poll Mode Driver
FAQ Frequently asked questions.
Getting Started Guide for DPDK FreeBSD Linux GSG.
FreeBSD
Contributor’s Guidelines Do you want to contribute code and/or documentation to
the DPDK community?

Frequently Used Tools and Scripts


The scripts listed in this section can be found in the tools or scripts subdirectories of your DPDK install. Below
are some frequently used tools and scripts to study.

./tools/setup.h Menu-driven setup script In tools subdirectory


./tools/dpdk_nic_bind.p y For binding NIC to driver In tools subdirectory

./tools/cpu_layout.py For lcore number and layout In tools subdirectory


./tools/pmdinfo.py For PMD info In tools subdirectory
http://dpdk.org/doc/dts/gsg DPDK Test Suite (DTS) Getting Started Guide—DTS
Tool Usage Examples
Finding Memory Information with Linux* Command /proc/meminfo
Finding Huge Page Information with ./setup.sh
./setup.sh has an option to list huge page information from /proc/meminfo (option 29 in the version of DPDK shown
here).

Binding/ Unbinding NIC with ./dpdk_nic_bind.py


Finding CPU layout with ./cpu_layout.py
CPU info can also be found with DPDK script ./cpu_layout.py.

More Scripts in Scripts Subdirectory


Fill in the description for each script:

Auto-config-h.sh
Check-git-log.sh
Check-maintainers.sh
Checkpatches.sh
Cocci.sh
Depdirs-rule.sh
Gen-build-mk.sh
Gen-config-h.sh
Load-devel-config.sh
Relpath.sh
Test-build.sh
Test-null.sh
Validate-abi.sh

Build Your Own DPDK Traffic Generator—DPDK-In-A-Box


Introduction
The purpose of this
cookbook module is to guide
you through the steps
required to build a Data
Plane Development Kit
(DPDK) based traffic
generator.

We built a DPDK-in-a-Box using the MinnowBoard Turbot* Dual Ethernet Dual-Core, which is a low cost, portable platform
based on the Intel Atom® processor E3826. For the OS, we installed Ubuntu* 16.04 client with DPDK. The instructions in
this document are tested on our DPDK-in-a-Box, an Intel® Core™ i7-5960X processor Extreme Edition brand desktop, and
an Intel® Xeon® Scalable processor. You can use any Intel® architecture platform to build your own device.
For the traffic generator, we use the TRex* realistic traffic generator. The TRex package is self-contained and can be easily
installed.

Any Intel® processor-based platform will work—desktop, server, laptop, or embedded system.

The DPDK Traffic Generator


Block Diagram

Software
 Ubuntu 16.04 Client OS with DPDK installed
 TRex Realistic Traffic Generator

Hardware
Our DPDK-in-a-Box uses a MinnowBoard Turbot Dual Ethernet Dual-Core single board computer:

 Out of the three Ethernet ports, the two at the bottom are for the traffic generator (dual gigabit Intel® Ethernet
Controller I350). Connect a loopback cable between them.
 Connect the third Ethernet port to the Internet (to download the TRex package).
 Connect the keyboard and mouse to the USB ports.
 Connect a display to the HDMI Interface.
The MinnowBoard Turbot* Dual Ethernet Dual-Core

The MinnowBoard includes a microSD card and an SD adapter.

 Insert the microSD card into the microSD Slot. The SD adapter should be ignored and not used.
 Power on the DPDK-in-a-Box system. Ubuntu will be up and running right away.

Install and Configure the TRex* Traffic Generator

Choose the username test and assign the password tester (or use the username and password specified by the Quick
Start Guide that comes with the platform).

 Log on as root and verify that you are in the /home/test directory with the following two commands:
# sudo su
# ls

Note NIC Information


The configuration file for the traffic generator needs the PCI bus-related information and the MAC address. Note this
information first using Linux* commands, because once the DPDK or packet generator is run, these ports are unavailable to
Linux.

1. For PCI bus-related NIC information, type the following command:


# lspci
You will see the following output. Note down that for port 0 the bus, function, and device number information is
03:00.0, and for port 1 the information is 03:00.1.

2. Find the MAC address with this command:

# ifconfig

You will see the following output. Note down that for port 0 the MAC address is 00:30:18:CB:F2:70 and for port 2
the MAC address is 00:30:18:CB:F2:71.

Note that the first port in the screenshot below, enp2s0, is the port connected to the Internet. No need to make a
note of this.
Here’s what I recorded after these two steps:

Item Port 0 Port 1

PCI Bus-related NIC info (from lspci) 03:00.0 03:00.1

MAC address 00:30:18:CB:F2:70 00:30:18:CB:F2:71

Fill the following table with the information you gathered from your specific platform:

Item Port 0 Port 1

PCI Bus-related NIC info (from lspci)

MAC address
If you succeeded in using ifconfig to get the port information described above, skip the next section and move on to
the section titled Install the Traffic Generator.

Troubleshooting – Ports Not Found


What if you don’t see both ports in response to the ifconfig command? One possible reason might be that you’ve run
the DPDK based application previously and the application might have claimed those ports, making them unavailable to
the kernel. In that case, you need to unbind the ports from the DPDK so that the kernel can claim them and you can find
the MAC address with the ifconfig command.

Root Cause
ifconfig is not showing the two ports below. Why?

The reason that ifconfig is unable to find the two ports is possibly because the DPDK application was previously run
and was aborted without releasing the ports, or it might be that a DPDK script runs automatically after boot and claims the
ports. Regardless of the reason, the solution below will enable ifconfig to show both ports.

Solution
1. Run ./setup.sh in the directory /home/test/dpdk/tools.

2. Display current Ethernet device settings.


Select Display current Ethernet device settings (option 23 in this case).

You can see that two ports are claimed by the DPDK driver.

The NICs in use by DPDK (specifically IGB-UIO)


3. Unbind the first port from IGB UIO.

Select option 30 and then enter the PCI address of device to unbind:

4. Bind the kernel driver igb to the device:

If the inputs entered are correct, the script acknowledges OK.


5. Verify by displaying current Ethernet device settings.

Success!

Above you will see the first port 0000:30:00.0 bound to the kernel.

Repeat steps 3–5 to unbind the second port, 0000:30:00.1, from IGB UIO and bind to IGB.

Use the ifconfig command to show that both ports are bound back to the kernel.
Install the Traffic Generator
In the following sections, we will assume that you successfully found the ports and have noted down the MAC addresses.

Keeping in mind my earlier note that change is the only constant thing in this fast-moving field, refer to the current TRex
user manual to make sure you have the latest script names, directory structure, and release information relevant to this
recipe. Enter the following commands:
# pwd
# mkdir trex
# cd trex
# wget –no-cache http://trex-tgn.cisco.com/trex/release/latest

You should see that the install is complete and saved in /home/test/trex/latest:
The next step is to untar the package:
# tar –xzvf latest

Below you see that version 2.08 is the latest version at the time of this screen capture:

# ls –al

You will see the directory with the version installed. In this exercise, the directory is v2.08, as shown below in response to
the ls –al command. Change directory to the version installed on your system; for example, cd <dir name with
version installed>:
# cd v2.08

# ls –al
You will see the file t-rex-64, which is the traffic generator executable:

Configure the Traffic Generator


The good news is that the TRex package comes with a sample config file cfg/simple_cfg.yaml. Copy that to
/etc/trex_cfg.yaml and edit the file by issuing the following commands, making sure that you’re in your
/home/test/trex/<your version> directory:
# pwd
# cp cfg/simple_cfg.yaml /etc/trex_cfg.yaml
# gedit /etc/trex_cfg.yaml
Edit the file as shown below with the applicable NIC information you gathered in previous steps:

Below is a line-by-line description of the configuration information required for /etc/trex_cfg.yaml:

 Port_limit should be 2 (since DPDK-in-a-Box has two ports)


 Version should be 2
 Interfaces should be the PCI bus ports you gathered using lspci. In this exercise they are [“03:00.0”,
“03:00.1”]
 Port_information contains a dest_mac, src_mac pair, which will be in the packet header of the traffic
generated. The first pair is for port 0. Since port 0 is connected to port 1, the first dest_mac is the MAC address of port
1. The second pair is for port 1. Since port 1 is connected to port 0, the second dest_mac is the MAC address of port 0.
Please note that when you connect an appliance to which traffic must be injected, the dest_mac addresses will be that
of the appliance.

Note Platform lcore Count


This section is for informational purposes only.

# cat /proc/cpuinfo will give you the logical core (lcore) information as shown in the Exercises section.
Why is this information useful?

The command line below that runs the traffic generator uses the –c option to specify the number of lcores to be used for
the traffic generator. You want to know how many lcores exist in the platform. Hence, issuing cat
/proc/cpuinfo and eyeballing the number of lcores that are available in the system will be helpful.

Run the Traffic Generator


# sudo ./t-rex-64 –f cap2/dns.yaml –c 1 –d 100

What are the parameters –f, -c, and –d?

-f for YAML traffic configuration file


-c for number of cores. Monitor the CPU percentage of TRex—it should be ~50 percent. Use cores accordingly
-d for duration of the test (sec). Default: 0
Below are three output screens: 1) During the traffic run, 2) Linux top command output, and 3) Final output after the
completion of the run.

Screen output showing traffic during run (15 packets so far Tx and Rx).

Output of top –H command during the run.


Screen output after completing the run (100 packets Tx and Rx).

Congratulations! By completing the above hands-on exercise, you have successfully built your own DPDK based traffic
generator.

Next Steps
As a next step, you can connect back-to-back two DPDK-in-a-Box platforms, and use one as a traffic generator and the
other as a DPDK application development and test vehicle.

Exercises
1. How would you configure the traffic generator for different packet lengths?
2. To run the traffic generator forever, what should be the value of –d?
3. How would you measure latency (assuming you have more cores)?
4. Reason out the root cause and find the solution by looking up the error, “Note that the uio or vfio kernel modules
to be used should be loaded into the kernel before running the dodk-devbind.py script” in Chapter 3 of the
DPDK.org document Getting Started Guide for Linux.

DPDK Transmit & Receive Loopback—DPDK-In-A-Box


Introduction

In the previous module, Build Your Own Traffic Generator—DPDK-In-A-Box, you


learned how to build a DPDK traffic generator. Once you’ve done this, the next step is to connect two platforms back to
back, and use one as the DPDK traffic generator and the other as a DPDK application development and test vehicle. But
what if you have just one system? Read on to learn how to generate traffic and run your DPDK application on the same
machine.

Traffic and the DPDK Application on a Single System


The purpose of this article is to show how to configure a single system to run the DPDK application and provide auto-
generated traffic. To provide the traffic, we will showcase testpmd, which many DPDK developers and customers consider
to be the stethoscope of a DPDK developer.

For your system, you can use any Intel® platform. The instructions in this article have been tested with an Intel® Xeon®
processor-based desktop, server, and laptop using either the DPDK traffic generator you built from scratch, or a
commercially available DPDK in-a-Box. This is a low-cost, portable platform based on an Intel Atom E3826 processor. At the
time this article was published, it was possible to purchase a DPDK-in-a-Box. Look online if you’re interested in this option.

If you are new to DPDK, spend some time reading the DPDK Programmer’s Guide at dpdk.org.

Stethoscope of DPDK Developer—Testpmd

Auto-Generating Traffic with tx_first Parameter


Challenge
Data plane applications need a traffic generator. The DPDK provides both RX and TX functionality and DPDK
applications build poll mode drivers with the RX and TX libraries. With the DPDK poll mode driver, the RX
functionality of the driver polls ingress traffic, and after applicable processing, the TX functionality of the poll
driver transmits the processed data to the egress interface.

Because the start of the data path involves polling for packets received, data plane applications need a traffic
generator. But here we have only one platform. How do you configure the application for this traffic?

Solution
This is where the testpmd tx_first parameter comes in handy. When testpmd is started with the
tx_first parameter, the TX function gets executed first—hence the name tx_first—and with an
external cable connecting RX and TX, those packets are now available for the RX function to poll. Thus, you
have achieved running traffic through testpmd without an external traffic generator.
The following screenshots show how to start testpmd and run tx_first with a loopback cable in place.
Starting testpmd
./x86_64-native-linuxapp-gcc/app/testpmd -- -i starts testpmd. –i stands for interactive. Please refer
to the Testpmd Application User Guide at dpdk.org and to the article Testing DPDK Performance and Features with
TestPMD on Intel® Developer Zone for more information about how to build and run testpmd.

While you can use the above command in your specific platform, it is available as a script named run.sh in DPDK-in-a-Box.

./run.sh starts testpmd, as shown below, yielding the testpmd prompt. Look at the flowchart. What does the
initial portion of testpmd do? And what does the runtime portion of testpmd do? Our next step is to initialize
testpmd.
Initialization
Initialization consists of three steps as shown below.

1. EAL (Environment Abstraction Layer)—Find the number of cores and probe PCI devices
2. Initialize memory zones and memory pools
3. Configure the ports for data path operation
Operation
After initialization, data path operations start and continue in a loop. The Poll Mode Driver (PMD) section of the DPDK
Programmer’s Guide is a must-read chapter. It will help you understand and appreciate how you can get the juice out of
your system and achieve the desired throughput.

Optimization Knobs You Should Understand


You may want to read through the documentation to fully understand the optimization knobs like those shown in the last
paragraph of the output from the start tx_first command.

For example, you can note that the RX descriptor counts are 128, whereas the TX desc descriptors are shown as 512. Why
are they not equal? And why is the TX descriptor four times that of the RX descriptor? Also, analyze the values indicated
for the RX threshold registers: pthresh = 8, hthresh = 8, and wthresh = 4; whereas for the TX registers: pthresh =
8, hthresh = 1, and wthresh = 16.

Reading the data sheet, and more importantly the optimization white papers of the Intel® 82599 10-GbE Ethernet
Controller, at least the Receive and Transmit sections, will help you to understand and make the best use of your knobs.

Running testpmd Using tx_first Option


testpmd> start tx_first

As shown above, at the testpmd> prompt, enter start tx_first to auto-generate the traffic. Packets are transmitted
when the testpmd> command returns. Please note that packets continue to be generated until they are stopped.

In this case, let it run for 10 to 20 seconds, and then stop the run.

testpmd> stop

Below you can see the RX and TX total packets per port as well as accumulated totals for both ports.
Starting testpmd Without tx_first
What happens when you start testpmd without tx_first? The flowchart below shows that RX starts polling first, so in
this case you need a traffic generator.

Flowchart for the case of without tx_first.

We saw that with tx_first, TX gets executed first. This emits packets first and before polling, thus auto-generating the
traffic. This allows us to run testpmd with traffic using a single platform.

Now, draw the flowchart for using tx_first.

Learn More About testpmd


Use -h or --help to find the available command-line options and thus the available functionality of testpmd.

I highly recommend that you learn hands-on the functionalities of interest to you and note them in the Exercise section.
Once you are done, quit by using the quit command.
quit, as shown below, does the following:

 Stops the ports


 Closes the ports

Summary
At this point, you’ve configured a single system to run the DPDK application and generated tx-first and rx_first
traffic with testpmd. Test your knowledge with the exercises below.

Exercises
1. tx_first auto-generates traffic. How are the parameters of the traffic programmed in this case?
2. You saw the flowchart for the case of without tx_first. Draw the flowchart for the case with tx_first.
3. Note each command-line option and functionality you tried with testpmd, and list what you learned about each one,
with any suggestions you may have.
4. What is the difference between detaching a port and closing a port? Where will you use detaching a port? Where will
you use closing a port?
5. What is the difference between dpdk_nic_bind.py and dpdk-devbind.py? Explain.
6. Search the Internet and the dpdk.org dev mailing list to analyze the root cause and find the solution for the error. Note
that the uio or vfio kernel modules to be used should be loaded into the kernel before running the dpdk-devbind.py
script.

Build Your Own DPDK Packet Framework with DPDK-In-A-Box

Introduction
The title of this module might just as well be “Build Your Own Software Defined DPDK
Application.” The DPDK packet framework uses a modular building block approach
defined by a configuration file to build complex DPDK applications. For an overview of
the value of the DPDK packet framework, watch the short video Deep Dive into the
Architecture of a Pipeline Stage before you get started with this module.

Here you will build a DPDK packet framework with just two cores—one for master core
tasks and the other to perform DPDK application functions.

For hardware, you can use any IA platform—Intel Xeon brand or Intel Atom brand
desktop, server, or laptop. We will use DPDK-in-a-Box here. This is a low-cost, portable
platform based on the Intel Atom E3826 processor.

To build your own DPDK-in-a-Box, or learn where to purchase a DPDK-in-a-Box, please


see the earlier module in this cookbook, Build Your Own DPDK Traffic Generator—DPDK-In-A-Box.

Set DPDK Traffic Generator MAC Addresses


If you remember when we built a DPDK Traffic Generator, the configuration file of the DPDK traffic generator was set with
its own port’s MAC Addresses, since we looped the ports within themselves.

Here we are connecting the ports to an external DPDK packet Generator, so we will set the MAC addresses in the DPDK
traffic generator to match those of the external system that will run the DPDK packet framework.

It is left as an exercise for the developers to find out the MAC address and set the traffic generator configuration file with
them.
Update the Configuration File for the DPDK Packet Framework
The DPDK packet framework configuration files provide a software defined modular way to implement complex DPDK
applications. If your system has multiple lcores available for packet processing, you can implement both run-to-completion
as well as pipelined applications. Here, since we have only one core for packet processing with DPDK-in-a-box, we will
showcase the run-to-completion implementation.

Building and Installing DPDK Packet Framework


We’ll now go through the steps for building and installing DPDK packet framework, as described in the DPDK sample
application user’s guide.

Running the Traffic Through DPDK Packet Framework


Connect the systems together. Provide the command-line option of the traffic generator continuously (this was part of the
exercise in the traffic generator building cookbook section). This runs the traffic through DPDK packet framework.

Run Your Application that is Software Defined by Packet Framework


 Run your application that is software defined by packet framework.
 Use profilers to find out where the CPU is spending most of the cycles and if it is in line with your expectation.
 Write up your observations and share with the community in dpdk.org

Summary
Application developers will benefit by understanding DPDK assumptions on roles / responsibilities of applications. They
need to comprehend the scope of DPDK’s roles / responsibilities to begin with. This helps them to rightly architect from
the get-go obeying DPDK’s assumptions in terms of thread safety, lockless API call usage, multiprocessor synchronization,
and control plane and data plane synchronization.

Exercise
1. Draw your software-defined application block diagram.

DPDK Data Plane—Multicores and Control Plane Synchronization


Introduction
Many developers and customers are under the impression that DPDK documentation and sample applications include only
data plane applications. In a real-life scenario, it is necessary to integrate the data plane with the control and management
plane. The purpose of this cookbook module is to describe some simple multiplane scenarios.

For hardware, you can use any IA platform—Intel Xeon brand or Intel Atom brand desktop, server, or laptop. We will use
DPDK -in-a-Box here. This is a low-cost, portable platform based on the Intel Atom E3826 processor.

To build your own DPDK-in-a-Box, please see the earlier module in this cookbook, Build Your Own DPDK Traffic
Generator—DPDK-In-A-Box.

Simple Scenarios—Data Plane and Control Plane Interactions


Every product or appliance will have its own share of data plane and control plane functionality. Here we will
illustrate two simple run-time scenarios:

1. A NIC port configuration change


2. Changing the port itself

Scenario 1: Change of Hardware


In a multiple core server with as many as 72 lcores (lcore stands for logical core), with multiple NIC ports performing
packet processing in parallel, how do you synchronize the control plane operations with the data plane? What do you need
to understand in order to play by the DPDK rules of the game?
What must be synchronized in order to pull out a transceiver and insert it again—during runtime?

Likewise, to pull out a transceiver, say 10 Gig, and insert a completely different one, say 1 Gig?

If a change of hardware device requires a release of the instance of the device that was removed and creation of a device
instance for the new one, what do you do with threads that are still accessing the data structures of the original instance?

Releasing resources requires coordination with the user space applications using those resources. Applications can be
using a single core or multiple cores. If the resource being released is used by multiple cores, we need to request an
acknowledgement handshake from each core in use, indicating that they are all finished with the resource and it can be
released safely.

Scenario 2: No Change of Hardware but Change of Parameter


Assume you are not changing any hardware during runtime. But you do want to change some global parameter—say MTU.
This may be a lightweight initialization compared to the previous case of heavy weight initialization. So, when you use an
API that does a lightweight initialization, which parameters can you expect to be persistent across the operation and which
parameters can you not assume will remain the same? This is very useful information to know in order to correctly change
parameters during runtime.

Please note that this is only part of the story. The other part is synchronizing with data plane applications that are running,
that is, waiting for a resource to be available, so that APIs to reconfigure can be called. We will look at that and refer to
pointers available in DPDK documentation and source code.

Before we get into these details, let’s step back and look at the big picture:

1. What are the core assumptions DPDK makes in terms of concurrency?


2. What are the boundaries of what DPDK controls and what the application must manage to ensure
synchronization?

Rules for Polling Queues


Can Multiple Cores Poll One RX Queue Simultaneously?
By design, the receive function of a PMD CANNOT be invoked in parallel on multiple, that is, two or more logical cores to
poll the same RX queue [of the same port]. What is the benefit of this design? It is that all the functions of the Ethernet
Device API exported by a PMD are lock-free functions. This is possible because the receive function will not be invoked in
parallel on different logical cores to work on the same target object.
PMD RX function CANNOT
be invoked in parallel on two
or more lcores to poll the same
queue [or the same port].

In the case of a single RX queue per port, only one core at a time can do RX processing.

When you have multiple RX queues per port, each queue can be polled by only one lcore at a time. Thus, if you have 4 RX
queues per port, you can have four cores simultaneously polling the port if you’ve configured one core per queue.

Can You Have Eight Cores and Four RX Queues per Port?
No, since that assigns more than one core per RX queue.

Can You Have Four Cores with Eight RX Queues per Port?
We can only answer this question by knowing full configuration details. Even though you have more RX queues than cores,
if you have configured two cores for any single RX queue, that is not allowed. The key is not having more than one core per
RX queue, irrespective of more queues in total available, compared to the number of cores.

Can One lcore Poll Multiple RX Queues?


Yes. One lcore can poll multiple RX queues. What is the maximum number of RX queues that one lcore can poll? That
depends on performance requirements and how much headroom should be available for applications after servicing some
number of queues. Packet size and packet arrival rates also constrain the cycle budget available on the core.

Note that with the port numbering in the system, one lcore can poll multiple RX queues that need not be necessarily
consecutive. This is clear from the figure below. Lcore 0 polls RX Queue 0 and Rx Queue 2. It does not poll RX Queue 1 and
RX Queue 3.
lcore 0

Port 0
RX Queues

Port 1

Port 2

Port 3

One lcore polling multiple RX queues.

Who is Responsible for Mutual Exclusion so that Multiple Cores Don’t Work on the Same Receive Queue?
The one-line answer is—you—the application developer. All the functions of the Ethernet Device API exported by a PMD
are lock-free functions which are not to be invoked in parallel on different logical cores to work on the same target object.

For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same RX queue
[on the same port].

Of course, this function can be invoked in parallel by different logical cores on different RX queues.

Please note and be aware that it is the responsibility of the upper-level application to enforce this rule.

If you don’t design your application to enforce this exclusion, allowing multiple cores to step on each other while accessing
the device, you will get segmentation errors and crashes for sure. DPDK goes with lockless accesses for high performance
and assumes that you, as a higher-level application developer, will ensure that multiple cores do not work on the same
receive queue.

What if Your Design Requires Multiple Cores to Share Queues?


If needed, parallel accesses to shared cores by multiple logical cores must be explicitly protected by dedicated inline lock-
aware functions built on top of their corresponding lock-free functions of the PMD API.

TX Port: Why Should Each Core be Able to Transmit on Each and Every Transmit Port?
We saw that for an RX queue, an lcore can only poll a subset of RX ports, but what about TX ports? Can an lcore connect
only to a subset of TX ports in the system? Or should each and every lcore connect to all TX ports?
The answer is that a forwarding operation running on an lcore may result in a packet destined for any TX port in the
system. Because of this, each lcore should be able to transmit to each and every TX port.

RX Queues

TX Queues

An lcore can poll only a subset of RX ports, but can transmit to any TX port in the system.

While the Data Plane can be Parallel, the Control Plane is Sequential
Control plane operations like device configuration, queue (RX and TX) setup, and device start depend on certain sequences
to be followed. Hence, they are sequential.

Device Setup Sequence


To set up a device, follow this sequence:
rte_eth_dev_configure()
rte_eth_tx_queue_setup()
rte_eth_rx_queue_setup()
rte_eth_dev_start()

After that, the network application can invoke, in any order, the functions exported by the Ethernet API to get the MAC
address of a given device, the speed and the status of a device physical link, receive/transmit packet bursts, and so on.

Summary
Application developers will benefit from understanding DPDK assumptions regarding application roles and responsibilities.
To start, it’s important to comprehend the scope of DPDK’s roles and responsibilities. This will help you to correctly
architect from the get-go in terms of thread safety, lockless API call usage, multiprocessor synchronization, and control
plane and data plane synchronization.

Next Steps
Architect a couple of your own usage models of the data plane coexisting with the control and management plane. Look
for similar approaches used by testpmd and other applications, and described by the DPDK HowTo Guides. Test them out.

Exercises
1. Can you have eight cores per port with four RX queues per port?
2. Can you have four cores per port with eight RX queues per port?
3. What are the implications of multiple cores transmitting on one transmit port—in terms of control plane and data
plane synchronization?
4. Control plane operations—should it be done in interrupt context itself or as a deferred procedure?

DPDK Performance Optimization Guidelines White Paper


Abstract
This paper illustrates best-known methods and performance optimizations used in the Data Plane Development Kit
(DPDK). DPDK application developers will benefit by implementing these optimization guidelines in their applications. A
problem well stated is a problem half solved, thus the paper starts with profiling methodology to help identify the
bottleneck in an application. Once the type of bottleneck is identified, this module will help you determine the
optimization mechanism that DPDK uses to overcome the bottleneck. Specifically, we refer to the respective sample
application and code snippet that implements the corresponding performance optimization technique. The module
concludes with a checklist flowchart that DPDK developers and users can use to ensure they follow the guidelines given
here.

For cookbook-style instructions on how to do hands-on performance profiling of your DPDK code with VTune™ tools, refer
to the module Profiling DPDK Code with Intel VTune Amplifier.

Strategy and Methodology


A chain is really only as strong as its weakest link. So, the strategy is to use profiling tools to identify hotspots in the
system. Once the hotspot is identified, the corresponding optimization technique is looked up for the sample application
and code snippet as how it is already solved and implemented in the DPDK. Developers at this stage will implement those
specific optimization techniques in their application. They can run respective micro-benchmarks and unit tests on
applications provided with the DPDK.

Once the particular hotspot has been addressed, the application is again profiled to find the next hotspot in the system.
The above methodology is repeated to the point of satisfaction in terms of achieving desired performance.

The performance optimization involves a gamut of considerations shown in the checklist below:

1. Optimize the BIOS settings.


2. Efficiently partition non-uniform memory access (NUMA) resources with improved locality in mind.
3. Optimize the Linux configuration.
4. To validate each configuration change, run l3fwd—as is with default settings—and compare with published
performance numbers.
5. Run micro-benchmarks to pick and choose optimum high-performance components (for example, bulk
enqueue/bulk dequeue as opposed to single enqueue/single dequeue).
6. Pick a sample application that is similar to the target appliance, using the already fine-tuned optimum default
settings (for example, more TX buffer resources than Rx).
7. Adapt and update the sample application (for example, # of queues). Compile with the correct optimization flag
levels.
8. Profile the chosen sample application in order to have a known good comparison base.
9. Run with optimized command-line options, keeping improved locality and concurrency in mind.
10. How to best match application and algorithm to underlying architecture? Run profiling to find memory-bound?
I/O-bound? CPU-bound?
11. Apply the corresponding solution: Software prefetch for memory, block mode for I/O, to use Intel® Hyper-
Threading Technology (Intel® HT Technology) or not, if the application is CPU-bound.
12. Rerun profiling—Front-end pipeline stall? Back-end pipeline stall?
13. Apply corresponding solution. Write efficient code—branch prediction, loop unroll, compiler optimization, and so
on.
14. Still don't have desired performance? Back to #9.
15. Record best-known methods and share in dpdk.org.
Recommended Pre-reading
It is recommended that you read, at a minimum, the DPDK Programmer’s Guide, and refer to the DPDK Sample Application
User Guides before proceeding.

Please refer to other DPDK documents as needed.

BIOS Settings
To get repeatable performance, DPDK L3fwd performance numbers are achieved with the following BIOS settings:

NUMA ENABLED

Enhanced Intel SpeedStep® technology DISABLED


Processor C3 DISABLED
Processor C6 DISABLED
Intel® Hyper-Threading Technology ENABLED
Intel® Virtualization Technology for Directed I/O DISABLED
Intel® Memory Latency Checker (Intel® MLC) Streamer ENABLED
Intel® MLC Spatial Prefetcher ENABLED
DCU Data Prefetcher ENABLED
DCU Instruction Prefetcher ENABLED
CPU Power and Performance Policy Performance
Memory Power Optimization Performance Optimized
Memory RAS and Performance Configuration -> NUMA Optimized ENABLED

Memory RAS and Performance Configuration -> NUMA Optimized


Please note that if the DPDK power management feature is to be used, Enhanced Intel SpeedStep® technology must be
enabled. In addition, C3 and C6 should be enabled. However, to start with, it is recommended that you use the BIOS
settings as shown in the table and run basic L3fwd to ensure that the BIOS, platform, and Linux settings are optimal for
performance.

Refer to Intel document # 557159 titled Intel Xeon processor E7-8800/4800 v3 Product Family, for detailed understanding
of BIOS setting and performance implications.

Platform Optimizations
Platform optimizations include (1) configuring memory, and (2) I/O (NIC Cards), to take advantage of affinity to achieve
lower latency.

Platform Optimizations—NUMA and Memory Controller


Below is an example of a multi (dual) socket system. For the threads that run on CPU0, all the memory accesses going to
memory local to socket 0 result in lower latency. Any accesses that cross Intel® QuickPath Interconnect (Intel® QPI) to
access remote memory (that is, memory local to socket 1) incurs additional latency and should be avoided.
Problem: What happens when NUMA is set to DISABLED in the BIOS? When NUMA is disabled in the BIOS, the memory
controller interleaves the accesses across the sockets.

For example, as shown below, CPU0 is reading 256 bytes (four cache lines). With the BIOS NUMA state set to DISABLED,
memory controller interleaves the access across the sockets. Out of 256 bytes, 128 bytes are read from local memory and
128 bytes are read from remote memory.

The remote memory accesses end up crossing the Intel QPI link. The impact of this is increased time for accessing remote
memory, resulting in lower performance.

Solution: As shown below, with BIOS setting NUMA = Enabled, all the accesses go to the same socket (local) memory and
there is no crossing of Intel QPI. This results in improved performance due to lower memory access latency.

Key Take Away

Be sure to set NUMA = Enabled in the BIOS.


Platform optimizations—PCIe* layout and IOU affinity.

Linux* Optimizations
Reducing Context Switches with isolcpus
To reduce the possibility of context switches, it is desirable to give a hint to the kernel to refrain from scheduling other
user space tasks on to the cores used by DPDK application threads. The isolcpus Linux kernel parameter serves this
purpose. For example, if DPDK applications are to run on logical cores 1, 2, and 3, the following should be added to the
kernel parameter list:
isolcpus=1,2,3
Note: Even with the isolcpus hint, the scheduler may still schedule kernel threads on the isolated cores. Please note that
isolcpus requires a reboot.

Adapt and Update the Sample Application


Now that the relevant sample application has been identified as a starting point to build the end product, the
following are the next set of questions to be answered.
Configuration Questions
How to Configure the Application for Best Performance?
For example:

• How many queues can be configured per port?


• Can the same number of Tx and Rx resources be allocated?
• What are the optimal settings for threshold values?

Recommendation: The good news is that each sample application comes with not only optimized code flow but also
optimized parameters settings as default values. The recommendation is to use a similar ratio between resources for Tx
and Rx. The following are the references and recommendations for the Intel® 82599 10 Gigabit Ethernet Controller. For
other NIC controllers, please refer to the corresponding data sheets.

How Many Queues can be Configured per Port?


Please refer to the white paper Evaluating the Suitability of Server Network Cards for Software Routers for detailed test
setup and configuration on this topic.

The following graph (from the above white paper) indicates that you should not use more than two to four queues per
port since the performance degrades with a higher number of queues.

For the best-case scenario, the recommendation is to use one queue per port. In case more are needed, two queues per
port can be considered, but not more than that.

Ratio of the forwarding rate varying the number of hardware queues per port.
Can Tx Resources be Allocated the Same Size as Rx Resources?
Please use as per the default values that are used in the application. For example, for Intel 82599 10-GbE Ethernet
Controller, the default values are not equal; whereas for XL710, both RX and TX descriptors are of equal size.

Intel 82599 10-GbE Ethernet Controller: It is a natural tendency to allocate equal-sized resources for Tx and Rx. However,
please note that http://git.dpdk.org/dpdk/tree/examples/l3fwd/main.c shows that optimal default size for the number of
Tx ring descriptors is 512 as opposed to Rx ring descriptors being 128. Thus, the number of Tx ring descriptors is four times
that of the Rx ring descriptors.

The recommendation is to choose Tx ring descriptors four times the size of Rx ring descriptors and not to have them both
equal size. The reasoning for this is left as an exercise for the readers to find out.

Intel® 82599 10-GbE Ethernet Controller

However, for XL710 NIC [Equal Size RX and TX Descriptors]


What are the Optimal Settings for Threshold Values?
For instance, http://git.dpdk.org/dpdk/tree/test/test/test_pmd_perf.c uses the following optimized default parameters for
the Intel 82599 10-Gigabit Ethernet Controller.

Please refer to Intel 82599 10-Gigabit Ethernet Controller: Datasheet for detailed explanations.

Rx_Free_Thresh—A Quick Summary and Key Takeaway


The key takeaway is amortization of the cost of the PCIe* operation of updating the hardware register is done by
processing batches of packets before updating the hardware register.

Rx_Free_Thresh—In Detail

As shown below, communication of packets received by the hardware is done using a circular buffer of packet descriptors.
There can be up to 64 K-8 descriptors in the circular buffer. Hardware maintains a shadow copy that includes those
descriptors completed but not yet stored in memory.

The Receive Descriptor Head register (RDH) indicates the in-progress descriptor.

The Receive Descriptor Tail register (RDT) identifies the location beyond the last descriptor that the hardware can process.
This is the location where software writes the first new descriptor.

During runtime, the software processes the descriptors and upon completion of a descriptor, increments the Receive
Descriptor Tail (RDT) registers. However, updating the RDT after each packet has been processed by the software has a
cost, as it increases PCIe operations.

Rx_free_thresh represents the maximum number of free descriptors that the DPDK software will hold before sending them
back to the hardware. Hence, by processing batches of packets before updating the RDT, we can reduce the PCIe cost of
this operation.

Fine tune with the parameters in the rte_eth_rx_queue_setup ( ) function for your configuration:
1 ret =
rte_eth_rx_queue_setup(portid,
0, rmnb_rxd,
2 socketid, &rx_conf, 3
mbufpool[socketid]);
Compile With the Correct Optimization Flags
Apply the corresponding solution: Software prefetch for memory, block mode for I/O, to use Intel HT Technology for CPU-
bound applications.

Software prefetch for memory helps to hide memory latency and thus improves memory-bound tasks in data plane
applications.

PREFETCHW
Prefetch data into cache in anticipation of write: PREFETCHW, a new instruction from Intel® Xeon® processor E5-2650 v3
onward, hides memory latency and improves the network stack. PREFETCHW prefetches data into the cache in anticipation
of a write.

PREFETCHWT1
Prefetch hint T1 (temporal L1 cache) with intent to write: PREFETCHWT1 fetches the data to a location in the cache
hierarchy specified (T1 => temporal data with respect to first-level cache) by an intent to write a hint (so that data is
brought into Exclusive state via a request for ownership) and a locality hint.

T1 (temporal data with respect to first-level cache)—prefetches data into the second-level cache.

For more information about these instructions refer to the Intel® 64 and IA-32 Architectures Developer’s Manual.

Running with Optimized Command-Line Options


Optimize the application using command-line options to improve affinity, locality, and concurrency.

coremask Parameter and (Wrong) Assumption of Neighboring Cores


The coremask parameter is used with the DPDK application to specify the cores on which to run the application. For higher
performance, reducing inter-processor communication cost is of key importance. The coremask should be selected such
that the communicating cores are physical neighbors.

Problem: One may (mistakenly), assume core 0 and core 1 are neighboring cores and may choose the coremask
accordingly in the DPDK command-line parameter. Please note that these logical core numbers, and their mapping to
specific cores on specific NUMA sockets, can vary from platform to platform. While in one platform core 0 and core 1 may
be neighbors, in another platform, core 0 and core 1 may end up being across another socket.

For instance, in a single-socket machine (screenshot shown below), lcore 0 and lcore 4 are siblings of the same physical
core (core 0). So, the communication cost between lcore 0 and lcore 4 will be less than the communication cost between
lcore 0 and lcore 1.
Solution: Because of this, it is recommended that the core layout for each platform be considered when choosing the
coremask to use in each case.

Tools—dpdk/tools/cpu_layout.py

Use ./cpu_layout.py in the tools directory to find out the socket ID, the physical core ID, and the logical core ID (processor
ID). From this information, correctly fill in the coremask parameter with locality of processors in mind.

Below is the cpu_layout of a dual-socket machine.

The list of physical cores is [0, 1, 2, 3, 4, 8, 9, 10, 11, 16, 17, 18, 19, 20, 24, 25, 26, 27]

Please note that physical core numbers 5, 6, 7, 12, 13, 14, 15, 21, 22, 23 are not in the list. This indicates that one cannot
assume that the physical core numbers are sequential.

How to find out which lcores are using Intel HT Technology from the cpu_layout?

In the picture below, Lcore 1 and lcore 37 are hyper threads in socket 0. Assigning intercommunicating tasks to lcore 1 and
lcore 37 will have lower cost and higher performance compared to assigning tasks to lcore 1 with any other core (other
than lcore 37).
Save core 0 for Linux use and do not use core 0 for the DPDK.
Refer below for the initialization of the DPDK application. Core 0 is being used by the master core.
Do not use core 0 for the DPDK applications because it is used by Linux as the master core. For example, using l3fwd –c 0x1
… should be avoided since that would be using core 0 (which is serving the functionality of the master core) for l3fwd DPDK
application as well.

Instead, the command l3fwd –c 0x2 …. can be used so that the l3fwd application uses core 1.

In realistic use cases like Open vSwitch* with DPDK, a control plane thread pins to the master core and is responsible for
responding to control plane commands from the user or the SDN controller. So, the DPDK application should not use the
master core (core 0), and the core bit mask in the DPDK command line should not set bit 0 for the coremask.

Correct use of the Channel Parameter


Be sure to make correct use of the channel parameter. For example, use CHANNEL PARAMETER N = 3 for a 3-channel
memory system.
DPDK Micro-Benchmarks and Auto-Tests
DPDK micro-benchmarks and auto-tests are available as part of DPDK applications and examples. Developers use these
micro-benchmarks to do focused measurements for evaluating performance.

The auto-tests are used for functionality verification.

The following are a few sample capabilities of distributor micro-benchmarks for performance evaluation.

Time_cache_line_switch ()
How can I measure the time taken for a cache line round-trip between two cores and back again?

The time_cache_line_switch() function in http://git.dpdk.org/dpdk/tree/test/test/test_distributor_perf.c can be


used to time the number of cycles to round-trip a cache line between two cores and back again.

Perf_test()
How can I measure the processing time per packet?

The perf_test() function in http://git.dpdk.org/dpdk/tree/test/test/test_distributor_perf.c sends in 32 packets at a


time to the distributor and verifies at the end that the worker thread got all of them, and finally how long the processing
per packet took.
ring_perf_auto_test
How can I find the performance difference between single producer/single consumer (sp/sc) and multi-producer/multi-
consumer (mp/mc)?

Running ring_perf_auto_test in /app/test gives the number of CPU cycles, which enables you to study the performance
difference between single producer/single consumer and multi-producer/multi-consumer. It also shows the differences for
different bulk sizes. See the following screenshot output.

The key takeaway: Using sp/sc with higher bulk sizes gives higher performance.

Please note that even though the default ring_perf_autotest runs through the performance test with block sizes of 8 and
32, one can update the source code to include other desired sizes (modify the array bulk_sizes[] to include bulk sizes of
interest). For instance, find below the output with the block sizes 1, 2, 4, 8, 16, and 32.

Two-Socket System—Huge Page Size = 2 Meg


hash_perf_autotest runs through 1,000,000 iterations for each test, varying the following parameters, and reports
Ticks/Op for each combination shown in the table below:

Hash Function Operation Key Size (bytes) Entries Entries per Bucket
a) Jhash
a) Add on Empty a) 16 a) 1024, a) 1
b) Rte_hash_CRC b) Add Update b) 32 b) 1048576 b) 2
c) Look up c) 48 c) 4
d) 64 d) 8
e) 16

The Detailed Test Output section contains detailed test output and the commands you can use to evaluate
performance with your platform. The summary of the result is tabulated and charted below:
DPDK Micro-Benchmarks and Auto-Tests
Focus Area to
Use These Micro-Benchmarks and Auto-Tests
Improve
1 Ring for Inter- Performance comparison of bulk enqueue/bulk dequeue versus single enqueue/single dequeue
Core on a single core
Communication To measure and compare performance between Intel® HT Technology, cores, and sockets doing
bulk enqueue/bulk dequeue on pairs of cores
Performance of dequeue from an empty
ring: http://git.dpdk.org/dpdk/tree/test/test/test_ring_perf.c

Single producer, single consumer – 1 Object, 2 Objects, MAX_BULK Objects – enqueue/dequeue

Multi-producer, multi-consumer – 1 Object, 2 Objects, MAX BULK Objects – enqueue/dequeue

Tx Burst - http://git.dpdk.org/dpdk/tree/test/test/test_ring.c
Rx Burst - http://git.dpdk.org/dpdk/tree/test/test/test_pmd_ring.c
2 Memcopy Cache to cache
Cache to memory
Memory to memory
Memory to cache
http://git.dpdk.org/dpdk/tree/test/test/test_memcpy_perf.c
3 Mempool “n_get_bulk”, “n_put_bulk”
1 core, 2 cores, max cores with cache objects
1 core, 2 cores, max cores without cache objects
http://git.dpdk.org/dpdk/tree/test/test/test_mempool.c
5 Hash Rte_jhash, rte_hash_crc;
Add
Lookup
Update
http://git.dpdk.org/dpdk/tree/test/test/test_hash_perf.c
6 ACL Lookup http://git.dpdk.org/dpdk/tree/test/test/test_acl.c
7 LPM Rule with depth > 24 1) Add, 2) Lookup, 3)
Delete http://git.dpdk.org/dpdk/tree/test/test/test_lpm.c
http://git.dpdk.org/dpdk/tree/test/test/test_lpm6.c
Large Route Tables:
http://git.dpdk.org/dpdk/tree/test/test/test_lpm6_data.h
8 Packet http://git.dpdk.org/dpdk/tree/test/test/test_distributor_perf.c
Distribution
9 NIC I/O Measure Tx Only
Benchmark Measure Rx Only,
Measure Tx and Rx
Benchmarks Network I/O Pipe - NIC h/w + PMD
http://git.dpdk.org/dpdk/tree/test/test/test_pmd_perf.c
10 NIC I/O + Increased CPU processing – NIC h/w + PMD + hash/lpm Examples/l3fwd
Increased CPU
processing
11 Atomic http://git.dpdk.org/dpdk/tree/test/test/test_atomic.c
Operations/
Lock-rd/wr http://git.dpdk.org/dpdk/tree/test/test/test_rwlock.c
12 SpinLock Takes global lock, displays something, then releases the global lock
Takes per-lcore lock, displays something, then releases the per-core lock
http://git.dpdk.org/dpdk/tree/test/test/test_spinlock.c
13 Software http://git.dpdk.org/dpdk/tree/test/test/test_prefetch.c
Prefetch Its usage: http://git.dpdk.org/dpdk/tree/lib/librte_table/rte_table_hash_ext.c
14 Packet http://git.dpdk.org/dpdk/tree/test/test/test_distributor_perf.c
Distribution
15 Reorder and http://git.dpdk.org/dpdk/tree/test/test/test_reorder.c
Seq. Window
16 Software Load http://git.dpdk.org/dpdk/tree/examples/load_balancer
Balancer
17 ip_pipeline Using the packet framework to build a pipeline:
http://git.dpdk.org/dpdk/tree/test/test/test_table.c
ACL Using Packet Framework
http://git.dpdk.org/dpdk/tree/test/test/test_table_acl.c

18 Re-entrancy http://git.dpdk.org/dpdk/tree/test/test/test_func_reentrancy.c
19 mbuf http://git.dpdk.org/dpdk/tree/test/test/test_mbuf.c
20 memzone http://git.dpdk.org/dpdk/tree/test/test/test_memzone.c
21 Virtual PMD http://git.dpdk.org/dpdk/tree/test/test/virtual_pmd.c
22 QoS http://git.dpdk.org/dpdk/tree/test/test/test_meter.c
http://git.dpdk.org/dpdk/tree/test/test/test_red.c
http://git.dpdk.org/dpdk/tree/test/test/test_sched.c
23 Link Bonding http://git.dpdk.org/dpdk/tree/test/test/test_link_bonding.c
24 Kni 1. Transmit
2. Receive to / from kernel space
3. Kernel requests
http://git.dpdk.org/dpdk/tree/test/test/test_kni.c
25 Malloc http://git.dpdk.org/dpdk/tree/test/test/test_malloc.c
26 Debug http://git.dpdk.org/dpdk/tree/test/test/test_debug.c
27 Timer http://git.dpdk.org/dpdk/tree/test/test/test_cycles.c
28 Alarm http://git.dpdk.org/dpdk/tree/test/test/test_alarm.c

Compiler Optimizations
Reference: PySter*—Compiler design and construction—“Adding optimizations to a compiler is a lot like eating
chicken soup when you have a cold. Having a bowl full never hurts, but who knows if it really helps. If the
optimizations are structured modularly so that the addition of one does not increase compiler complexity, the
temptation to fold in another is hard to resist. How well the techniques work together or against each other is hard
to determine."

Performance Optimization and Weakly Ordered Considerations


Background: Linux kernel synchronization primitives contain needed memory barriers as shown below (both
uniprocessor and multiprocessor versions):

Smp_mb ( ) Memory barrier

Smp_rmb ( ) Read memory barrier

Smp_wmb ( ) Write memory barrier

Smp_read_barrier_depends ( Forces subsequent operations that depend on prior operations


) to be ordered

Ordering on MMIO writes that are guarded by global


Mmiowb ( )
spinlocks
Code that uses standard synchronization primitives (spinlocks, semaphores, read copy updates) should not need
explicit memory barriers, since any required barriers are already present in these primitives.

Challenge: If you are writing code bypassing these standard synchronization primitives for optimization purposes,
then consider your requirement in using the proper barrier.

Consideration: x86 provides a process ordering memory model in which writes from a given CPU are seen in order
by all CPUs, and weak consistency, which permits arbitrary reordering, limited only by explicit memory-barrier
instructions.

The smp_mp ( ), smp_rmb ( ), smp_wmb ( ) primitives also force the compiler to avoid any optimizations that would
have the effect of reordering memory optimizations across the barriers.

Some Intel® Streaming SIMD Extensions (SSE) instructions are weakly ordered (clflush and non-temporal move
instructions). CPUs that have SSE can use mfence for smp mb(), lfence for smp rmb(), and sfence for smp wmb().

Detailed Test Output


Pmd_perf_autotest
To evaluate your platform’s performance, run /app/test/pmd_perf_autotest.

The key takeaway: The cost for RX+TX cycles per packet in test Polled Mode Driver is 54 cycles
with 4 ports and –n = 4 memory channels.
What if you need to find the cycles taken for only RX? Or only TX?

To find RX-only time, use the command set_rxtx_anchor rxonly before issuing the command pmd_perf_autotest.
Similarly, to find TX-only time, use the command set_rxtx_anchor txonly before issuing the command
pmd_perf_autotest.

Packet Size = 64B # of channels n= 4

# of cycles per packet TX+RX Cost TX only Cost Rx only Cost

With four ports 54 cycles 21 cycles 31 cycles


Below is the screen output for the rxonly and txonly cost, respectively.
Hash Table Performance Test Results
To evaluate the performance in your platform, run /app/test/hash_perf_autotest.
Memcpy_perf_autotest Test Results
To evaluate the performance in your platform, run /app/test/memcpy_perf_autotest, for both 32 bytes
aligned and unaligned.
Mempool_perf_autotest Test Results

Core Bulk Get Bulk Put # of Kept


Cache Object
Configuration Size Size Objects

a) with cache a) 32
a) One Core a) 1 a) 1
object b) 128
b) Two Cores b) 4 b) 4
b) without cache
c) Max. Cores c) 32 c) 32
object

To evaluate the performance in your platform, run /app/test/mempool_perf_autotest.


Timer_perf_autotest Test Results

# of Timers Configuration Operations Timed

a) 0 Timer - Appending
b) 100 Timers - Callback
c) 1000 Timers - Resetting
d) 10,000 Timers
e) 100,000 Timers
f) 1,000,000 Timers

To evaluate the performance in your platform, run /app/test/timer_perf_autotest.


For cookbook-style instructions on how to do hands-on performance profiling of your DPDK code with
VTune tools, continue to the module Profiling DPDK Code with Intel VTune Amplifier.

Performance Profiling Resources

 Document #5571159 Intel Xeon processor E7-8800/4800 v3 Performance Tuning Guide


 Intel® Optimizing Non-Sequential Data Processing Applications – Brian Forde and John Browne
 Measuring Cache and Memory Latency and CPU to Memory Bandwidth - For use with Intel
Architecture – Joshua Ruggiero
 Tuning Applications Using a Top-down Microarchitecture Analysis Method
 Intel® Processor Trace architecture details can be found in the Intel 64 and IA-32 Architectures
Software Developer Manuals
 Evaluating the Suitability of Server Network Cards for Software Routers
 Low Latency Performance Tuning Guide For Red Hat Enterprise Linux 6 Jeremy Eder, Senior
Software Engineer
 Red Hat Enterprise Linux 6 Performance Tuning Guide
 Memory Ordering in Modern Microprocessors – Paul E McKenney Draft of 2007/09/19 15:15
 What is RCU, Fundamentally?

Profiling DPDK Code with Intel® VTune™ Amplifier

Introduction
Performance is a key factor in designing and shipping best of class products. Optimizing performance
requires visibility into system behavior. In this module, we’ll learn how to use Intel VTune Amplifier to
profile Data Plane Development Kit (DPDK) code.

You will find this module to be a comprehensive reference for installing and use of the Intel VTune
Amplifier, and will learn how to run some DPDK micro benchmarks as an example of how to get deep
visibility into system, cores, communication, and core pipeline and usage.

Extensive screenshots are provided for comparison with your output. The commands are given, in
addition, so that the readers can copy and paste wherever possible.

Outline
This module walks you through the following steps to get started using Intel VTune Amplifier with a DPDK
application.

• Install Linux
• Install Data Plane Development Kit (DPDK)
• Install the tools
o Source editor
o Intel VTune Amplifier
• Install and profile the application of your choice
o Distributor application
o Ring tests application
• Conclusion and next steps

Install Linux
Install from the Linux DVD with an ISO image:
http://old-releases.ubuntu.com/releases/15.04/ubuntu-15.04-desktop-amd64.iso

Prior to Install
If you have a laptop installed with Windows* 8, go to safe mode (SHIFT+RESTART). Once in safe mode,
choose boot option # 1 to boot from the external USB DVD drive. Restart and install.

After Install
1. Verify whether the kernel version installed is the correct version as per the DPDK release notes.
$uname –a

The above output verifies the kernel release as 3.19.0-59-generic, the version number as #66, and the
distro as Ubuntu 64 bit.
$uname –v

Displays the version # – version #66 as shown below.


$lsb_release –c

Shows the code name—the code name is vivid, as shown below.


2. Verify Internet connectivity. In some cases, the network-manager service has to be restarted for the
Ethernet service to be operational.

$ sudo service network-manager restart

Install DPDK
Download the DPDK

3. Get the latest DPDK release, as shown below and in the screenshot.
$ sudo wget www.dpdk.org/browse/dpdk/snapshot/dpdk-16.04.tar.xz

The response for the above command is as shown below.

You will find the DPDK tar file downloaded, as shown below.
$ ls

4. Extract the tar ball.


$ tar xf dpdk-16.04.tar.xz

You will find that the directory dpdk-16.04 was created.


$ ls
5. Change to the DPDK directory to list the files.
$ cd dpdk-16.04
$ ls –al

Install the Tools


Install the source editor of your choice. Here, CSCOPE is chosen.

1. First, check to see whether the correct repository is enabled.

Check that the universe repository is enabled by inspecting /etc/apt/sources.list


$ sudo gedit /etc/apt/sources.list

As highlighted below, you may see that the archive is restricted.


If this is the case, edit the file by replacing restricted with universe.

Now save the file.

2. Update the system.


$ sudo apt-get update

The system is updated as shown below.

Install CSCOPE.
$ sudo apt-get install cscope

As shown above, CSCOPE 15.8a-2 is installed.

Install Kernel Debug Symbols


1. The first step is to add the repository containing debugging symbols. For that, create a new file,
ddebs.list (if it does not exist already).
$ cat /dev/null > /etc/apt/sources.list.d/ddebs.list

2. Edit the file.


$ gedit /etc/apt/sources.list.d/ddebs.list

3. Add the following line to /etc/apt/sources.list.d/ddebs.list as shown below and save it.
deb http://ddebs.ubuntu.com/ vivid main restricted universe
multiverse

4. Update the system to load the package list from the new repository.

$ sudo apt-get update


In this example, the system gave the following error:

If you don’t see the resolution error in your system, skip the instructions that follow and proceed to
the next section.

5. To resolve name servers:

$ sudo gedit /etc/resolvconf/resolv.conf.d/tail

Add to the file the two name servers (below) as seen in the example below, and save the file.
6. Restart the service. It is necessary to do this before the step that follows, or you’ll still see the resolve
error.
$ sudo /etc/init.d/resolveconf restart

After the shutdown and restart, restart the service.

7. Update the system.

$ sudo apt-get update

With the above steps, access to http://ddebs.ubuntu.com has been resolved. However there is a new
error, GPG error, as shown at the bottom of the screenshot below.

8. Add the GPG key.


$ sudo apt-key adv –keyserver pool.sks-keyservers.net –recv-keys
C8CAB6595FDFF622

9. With the repository added, the next step is to install the symbol package by running the following
command:
apt-get install linux-image-<release>-dbgsym=<release>.<version>

With the release as 3.19.0-59-generic and the version as 66 this is:

$ apt-get install linux-image-3.19.0-59-generic-dbgsym=3.19.0-59.66

Please note that the above resulted in an error because it could not locate the package linuximage-
3.19.0-59-generic-dbgsym. If you want to set breakpoints by function names and viewing local
variables, this error must be resolved.

10. Install the Linux Source Package.


$ sudo apt-get install linux-source-3.19.0=3.19.0-59.66
11. With the package now installed, go to /usr/src/linux-source-3.19.0 and unpack the source tarball.
$ cd /usr/src/linux-source-3.19.0

$ tar xjf linux-source-3.19.0.tar.bz2

Now you’re ready to install Intel VTune Amplifier to profile DPDK.

Getting Started With Intel VTune Amplifier

If you don’t have Intel VTune Amplifier installed, click https://software.intel.com/en-us/intel-vtune-


amplifier-xe to get to the Intel VTune Amplifier download page. Download Intel VTune Amplifier 2018,
which is the current version at the time this article was written. The articles Intel VTune Amplifier
Installation Guide - Linux Host and Getting Started with Intel VTune Amplifier 2018 will guide you through
the process and provide links to additional resources.
Key Features
Now that you have VTune Amplifier installed, let’s see what it can do. Here are some key features.

Algorithm Analysis

 Run Basic Hotspots analysis type to understand application flow and identify sections of
code that get a lot of execution time (hotspots).
 Use the algorithm Advanced Hotspots analysis to extend Basic Hotspots analysis by
collecting call stacks and analyze the CPI (Cycles Per Instructions) metric. NEW: You can
also use this analysis type to profile native or Java* applications running in a Docker*
container on a Linux system.

 Use Memory Consumption analysis for your native Linux or Python* targets to explore
RAM usage over time and identify memory objects allocated and released during the
analysis run.

 Run Concurrency analysis to estimate parallelization in your code and understand how
effectively your application uses available cores.
 Run Locks and Waits analysis to identify synchronization objects preventing effective
utilization of processor resources.
Microarchitecture Analysis

 Run General Exploration analysis to triage hardware issues in your application. This
type collects a complete list of events for analyzing a typical client application.
 Use Memory Access analysis to identify memory-related issues, like NUMA problems
and bandwidth limited accesses, and attribute performance events to memory objects
(data structures), which is provided due to instrumentation of memory allocations/de-
allocations and getting static/global variables from symbol information.
 For systems with Intel® Software Guard Extensions (Intel® SGX) feature enabled, run
SGX.
 Run Hotspots analysis to identify performance-critical program units inside security
enclaves. This analysis type uses the INST_RETIRED.PREC_DIST hardware event that
emulates precise clock ticks, which is mandatory for the analysis on the systems with
Intel SGX enabled.

 For the Intel processors supporting Intel® Transactional Synchronization Extensions


(Intel® TSX), run the TSX Exploration and TSX Hotspots analysis types to measure
transactional success and analyze causes of transactional aborts.
Platform Analysis

 Run System Overview analysis to review general behavior of a target Linux or Android*
system and correlate power and performance metrics with the interrupt request (IRQ).
 Run CPU/GPU Concurrency analysis to identify code regions where your application is
CPU- or GPU-bound.
 Use GPU Hotspots analysis to identify GPU tasks with high GPU utilization, and estimate
the effectiveness of this utilization.
 For GPU-bound applications running on Intel® HD Graphics, collect GPU hardware
events to estimate how effectively the processor graphics are used.
 Collect data on ftrace* events on Android and Linux targets and Atrace* events on
Android targets.
 Analyze hot Intel® Media SDK programs and OpenCL™ kernels running on a GPU. For
OpenCL application analysis, use the architecture diagram to explore GPU hardware
metrics per GPU architecture blocks.
 Run Disk Input and Output analysis to monitor utilization of the disk subsystem, CPU,
and processor buses. This analysis type provides a consistent view of the storage
subsystem combined with hardware events and an easy-to-use method to match user-
level source code with I/O packets executed by the hardware.
Compute-Intensive Applications Analysis

 Run HPC Performance Characterization analysis to identify how effectively your high-
performance computing application uses CPU, memory, and floating-point operation
hardware resources. This analysis type provides additional scalability metrics for
applications that use OpenMP* or Intel® MPI Library runtimes.
 Run an algorithm analysis type with the Analyze OpenMP regions option enabled to
collect OpenMP or Intel MPI data for applications using OpenMP or Intel MPI runtime
libraries. Note that HPC Performance Characterization analysis has the option enabled
by default.
 For OpenMP applications, analyze the collected performance data to identify
inefficiencies in parallelization. Review the potential gain metric values per OpenMP
region to understand the maximum time that could be saved if the OpenMP region is
optimized to have no load imbalance, assuming no runtime overhead.
 For hybrid OpenMP and Intel MPI applications, explore OpenMP efficiency metrics by
Intel MPI processes laying on the critical path.
Source Analysis

 Double-click a hotspot function to drill down to the source code and analyze
performance per source line or assembler instruction. By default, the hottest line is
highlighted.
 For help on an assembly instruction, right-click the instruction in the Assembly pane and
select Instruction Reference from the context menu.

Managed Code Analysis

Configure target options for managed code analysis in the native, managed, or mixed
mode:
 Windows host only: Event-based sampling (EBS) analysis for Windows Store C/C++, C#
and JavaScript* applications running in the Attach or System-wide mode.
 EBS or user-mode sampling and tracing analysis for Java applications running in the
Launch Application or Attach mode.
 Basic Hotspots and Locks and Waits analysis for Python applications running in the
Launch Application and Attach to Process modes.

Custom Analysis

 Select the Custom Analysis branch in the analysis tree to create your own analysis
configurations using any of the available VTune Amplifier data collectors.

 Run your own custom collector from the VTune Amplifier to get the aggregated
performance data from your custom collection and VTune Amplifier analysis in the
same result.

 Import performance data collected by your own or third-party collector into the VTune
Amplifier result collected in parallel with your external collection. Use the Import from
CSV button to integrate the external data to the result.

 Collect data from a remote virtual machine by configuring KVM guest OS profiling,
which makes use of the Linux Perf KVM feature. Select Analyze KVM guest OS from the
Advanced options.

Remote Collection Modes


You can collect data on your Linux, Windows, or Android system using any of the following
modes:
 (Linux and Android targets) Remote analysis via SSH/ADB communication with VTune
Amplifier graphical and command-line interface (amplxe-cl) installed on the host and
VTune Amplifier target package installed on the remote target system. Recommended
for resource-constrained, embedded platforms (with insufficient disk space, memory, or
CPU power).
 (Android targets) Disconnected analysis via SSH/ADB communication with VTune
Amplifier installed on the host and the VTune Amplifier target package installed on the
remote Android system. The analysis is initiated from the host system, but data
collection does not begin until the device is unplugged from the host system. The
results are finalized after the device is reconnected to the host system.

 (Linux and Windows targets) Native performance analysis with the VTune Amplifier
graphical or command line interface installed on the target system. Analysis is started
directly on the target system.

 (Linux and Windows targets) Native hardware event-based sampling analysis with the
VTune Amplifier's Sampling Enabling Product (SEP) installed on the target embedded
system.

Stepping Back to See the Big Picture


It’s a good idea to step back and see the big picture first—as to what other components exist in the system.
If there are some unrelated component-consuming resources, and if we only focus on measuring our
specific application, then we may be coming to a wrong conclusion because of partial information.

So here, even before running the DPDK application, we run top –H to see where the CPU is spending its
cycles without our specific application running.

Below you will see the VTune Amplifier showing top –H and the Firefox* web browser running. Now,
top is something you just ran, whereas Firefox is something you don’t want taking CPU cycles while you
evaluate your application of interest. Similarly, you may find some unwanted daemons. So at this point,
stop any unwanted applications, daemons, and other components.
Pointing to the Source Directory
The following screenshot shows how to point to the source directory of the software components of
interest in VTune Amplifier. You can add multiple directories.
Profiling DPDK Code with VTune Amplifier
1. First, we’ll reserve huge pages. Note that we’ve chosen 128 huge pages here to accommodate a
possible memory constraint when testing on a laptop. If you’re using a server or desktop, you can
specify 1024 huge pages.
$ cd /home/dpdk/dpdk-16.04

$ sudo su

$ echo 128 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

2. Creating /mnt/huge and mounting as hgetlbfs

Next, we’ll create /mnt/huge and mount it as hgetlbfs.

$ sudo bash
$ mkdir –p –v /mnt/huge [-v for verbose, as you can see below
response from the system]
$ mount –t hugetlbfs nodev /mnt/huge
Making the mount point permanent across reboots, by adding the
following line to the /etc/fstab file:
nodev /mnt/huge hugetlbfs defaults 0 0

Look at /etc/fstab to confirm that /mt/huge was successfully created and mounted. See example below:
3. Build the DPDK test application and DPDK library:

$ export RTE_SDK=/home/dpdk/dpdk-16.04

$ export RTE_TARGET=x86_64-native-linuxapp-gcc

$ export EXTRA_CFLAGS=’-g’ [For DPDK symbols]

$ make install T=x86_64-native-linuxapp-gcc DESTDIR=install

The output of the build will complete successfully, as shown below.


4. Load uio modules to enable userspace IO for DPDK.
$ sudo modprobe uio

$ sudo insmod x86_64-native-linuxapp-gcc/kmod/igb_uio.ko

5. Add path to DPDK test application symbols to VTune Amplifier.

See the image below to illustrate this step.


You can verify the symbols in the above directory in test.map, as shown in the image below.
At this point, you are ready to get started profiling your DPDK code with VTune Amplifier.

Profiling DPDK Code with VTune Amplifier


Now we will run a handful of micro benchmarks. To start, cd to the directory below and run ./test.
$ cd /home/dpdk/dpdk-16.04/x86_64-native-linuxapp-gcc/app

$ sudo su

$ ./test

The test will issue prompt RTE>> as shown below. Enter ? for help and the list of available tests.
Profiling Distributor Perf Autotest
Our first test will be the distributor_perf_autotest. A diagram describing this application is
below.

Select the test from the options offered by RTE.


RTE>> distributor_perf_autotest

See below for command window output during the test run.
The VTune Amplifier summary highlights CPI rate, indicating it is beyond the normal range. It also highlights
Back End-Bound, indicating a memory-bound application nature. See these results on the screen capture
below.
Analysis Details
 Function/Call Stack indicates rte_distributor_poll_pkt consumes CPI at a rate of 3.720 and
mm_pause consumes CPI at a rate of 3.867.

You can observe that rte_distributor_get_pkt runs with a CPI rate of 26.30. However, it is not
highlighted, since it uses fewer clock ticks than the highlighted functions.

You will see other functions listed here along with the CPI each one uses, for example:
rte_distributor_process, rte_distributor_request_pkt,
time_cache_line_switch.
Profiling Rings
Communication between cores for interprocessor communication as well as communication between cores
and NIC happens through rings and descriptors.

While NIC hardware does optimizations in terms of RS bit and descriptor done bit (DD bit) in bunching the
data size, DPDK in addition enhances bunching with amortizing by offering API for bulk communication
through rings. The graphic below illustrates ring communication.
The rings tests show that single producer/single consumer (SP/SC) with bulk sizes both in
enqueue/dequeue gives best performance compared to multiple producers/multiple consumers
(MP/MC). Below are the steps.

Profiling ring_perf_autotest

In RTE, select ring_perf_autotest. Test output is shown in the cmd window below.

VTune Amplifier output for ring_perf_autotest shows in detail that the code is backend-bound. You
can see the call stack showing results for SP/SC with bulk sizes as well as MP/MC.
To appreciate the relative performance of SP/SC with single data size and bulk size, and comparing with
MP/MC with single data size and bulk size, refer to the following graph. Please note the impact of core
placement—a) siblings, b) within the same socket, c) across multisockets.
Conclusion and Next Steps
Practice profiling on additional sample DPDK applications. With the experience you gather, extend profiling
and optimization to the applications you are building on top of DPDK.

Get plugged in to the DPDK community to learn the latest from developers and architects and keep your
products highly optimized. Register at https://www.dpdk.org/contribute/.

References
Enabling Internet connectivity: http://askubuntu.com/questions/641591/internet-connection-not-working-
in-ubuntu-15-04

Getting Kernel Symbols/Sources on Ubuntu Linux:


http://sysprogs.com/VisualKernel/tutorials/setup/ubuntu/

How to debug libraries in Ubuntu: http://stackoverflow.com/questions/14344654/how-to-use-debug-


libraries-on-ubuntu

How to install a package that contains Ubuntu debug symbols:


http://askubuntu.com/questions/197016/how-to-install-a-package-that-contains-ubuntu-kernel-debug-
symbols

Debug Symbol Packages: https://wiki.ubuntu.com/Debug%20Symbol%20Packages

Ask Ubuntu for challenges in Apt-get update failure to fetch: http://askubuntu.com/questions/135932/apt-


get-update-failure-to-fetch-cant-connect-to-any-sources

DNS Name Server IP Address: http://www.cyberciti.biz/faq/ubuntu-linux-configure-dns-nameserver-ip-


address/

How to fix public key is not available issue: https://chrisjean.com/fix-apt-get-update-the-following-


signatures-couldnt-be-verified-because-the-public-key-is-not-available/

Ubuntu Key server: http://keyserver.ubuntu.com:11371/

Installing CSCOPE*: http://cscope.sourceforge.net

Performance optimization: http://www.agner.org/optimize/instruction_tables.pdf

Using Intel VTune Amplifier with a virtual machine: https://software.intel.com/en-us/node/638180

Additional Tools
The previous module helped you to understand how VTune Amplifier can help analyze performance of your
DPDK application. In this module we describe two other tools that you might find helpful.
Intel® Memory Latency Checker
Memory latency has to do with the time used by an application to fetch data from the processor’s cache
hierarchy and memory subsystem. Intel® Memory Latency Checker (Intel® MLC) measures memory latency
and bandwidth under load, with options for more detailed analysis of memory latency between a set of
cores to memory or cache.

Features
By default, Intel MLC identifies system topology and generates the following:

 A matrix of idle memory latencies for requests originating from each of the sockets and addressed to
each of the available sockets.
 Peak memory bandwidth measurement of requests containing varying numbers of reads and writes to
local memory.
 A matrix of memory bandwidth values for requests originating from each of the sockets and addressed
to each of the available sockets.
 Latencies at different bandwidth points.
 Cache to cache data transfer latencies.

For more information on basic operation of Intel MLC as well as coverage of the command options that
enable finer-grained analysis, read the article Intel Memory Latency Checker v3.5. It describes the
functionality of the most recent version of Intel MLC in detail, and includes download and installation
instructions.

Screenshots

The screenshots below illustrate basic operation of Intel MLC.


Local memory latencies and cross-socket memory latencies can vary significantly on multisocket systems
where NUMA is enabled. Intel MLC is a useful tool for measuring these latencies, as well as memory
bandwidth, and can help you in the task of profiling your application’s performance.

Processor Counter Monitor* (PCM)

Processor Counter Monitor* (PCM) is an open source project that includes a programming API as well as
several command-line utilities for gathering real-time performance and power metrics for Intel® Core™
processors, Intel Xeon processors, Intel Atom processors, and Intel® Xeon Phi™ processors. It supports
Linux, Windows, and several other operating systems. For detailed information, and to download, visit the
PCM GitHub* repository.

Using PCM to Evaluate a DPDK Application

Of the several tools included as part of PCM, which are recommended for use with DPDK? The list below
offers some suggestions. If your application is:

 CPU intensive, run PCM-x


 Memory intensive, run PCM-memory
 I/O intensive, run PCM-iio

Screenshots

The screenshots below illustrate PCM runtime output.


Summary

Intel MLC and PCM are handy, easy to use tools that you might find useful. VTune Amplifier is much more
powerful and versatile. If you haven’t used VTune Amplifier, download a free trial copy at the Intel VTune
Amplifier home page.
Acknowledgements
This cookbook is possible only with the whole team’s effort and all the encouragement, support, and review
from each and every one in the internal divisions as well as early access customers, network developers,
and managers.

Notices
Intel technologies’ features and benefits depend on system configuration and may require enabled
hardware, software or service activation. Performance varies depending on system configuration.
Check with your system manufacturer or retailer or learn more at intel.com.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is
granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied
warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All
information provided here is subject to change without notice. Contact your Intel representative to
obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause
deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be
obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
This sample source code is released under the Intel Sample Source Code License Agreement.
Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, Intel Xeon Phi, VTune, and Xeon are
trademarks of Intel Corporation in the U.S. and/or other countries.
Java is a registered trademark of Oracle and/or its affiliates. OpenCL and the OpenCL logo are
trademarks of Apple Inc. used by permission by Khronos.
Microsoft, Windows, and the Windows logo are trademarks, or registered trademarks of Microsoft
Corporation in the United States and/or other countries.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
*Other names and brands may be claimed as the property of others.
© 2018 Intel Corporation

You might also like