KEMBAR78
LinuxCon 2015 Linux Kernel Networking Walkthrough | PDF
Kernel Networking Walkthrough
LinuxCon 2015, Seattle
Thomas Graf
Kernel & Open vSwitch Team
Noiro Networks (Cisco)
Agenda
● Getting packets from/to the NIC
● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO
● Packet processing
● RX Handler, IP Processing, TCP Processing, TCP Fast Open
● Queuing from/to userspace
● Socket Buffers, Flow Control, TCP Small Queues
● Q&A
Touring the Network Stack
Expectation Reality
How does a packet get in and out of
the Network Stack?
Receive & Transmit Process
Ring Buffer
DMA
Parse
L2 & IP
Parse
TCP/UDP
Socket Buffer
Task /
Container
read()
Ring Buffer
Construct
IP
Construct
TCP/UDP
Local?
Socket Buffer
Forward
Route?
write()
NIC Network Stack
(Kernel Space)
Process
(User Space)
The 3 ways into the Network Stack
Ring Buffer
Network
Stack
Interrupt Driven
A
Ring Buffer
Network
Stack
NAPI based Polling poll()
B
Ring Buffer Network
Stack
Busy Polling busy_poll()
Task
C
RSS – Receive Side Scaling
● NIC distributes packets across multiple RX queues
allowing for parallel processing.
● Separate IRQ per RX queue, thus selects CPU to run
hardware interrupt handler on.
RX-queue-1
RX-queue-2
RX-queue-3
RX-queue-4
CPU 1
CPU 2
CPU 1
CPU 2
filter
RPS – Receive Packet Steering
● Software filter to select CPU # for processing
● Use it to ...
RX-queue-1
RX-queue-2
RX-queue-3
RX-queue-4
CPU 1
CPU 2
CPU 3
CPU 1
CPU 2
CPU 3
... redo queue - CPU mapping ... distribute single queue to
multiple CPUs
Hardware Offload
● RX/TX Checksumming
● Perform CPU intensive checksumming in
hardware.
● Virtual LAN filtering and tag stripping
● Strip 802.1Q header and store VLAN ID
in network packet meta data.
● Filter out unsubscribed VLANs.
● Segmentation Offload
Generic Receive Offload
(ethtool -K eth0 gro on)
Ring Buffer
Network
Stack
poll()
NAPI based GRO
MTU
GRO
Up to 64K
It's more effective to process 1x64K bytes packet
instead of 40x1500 bytes packets.
Segmentation Offload
(ethtool -K eth0 tso on)
(ethtool -K eth0 gso on)
Ring Buffer
Network
Stack
Generic Segmentation Offload (GSO)
ethtool -K eth0 gso on
MTU
TCP Segmentation Offload (TSO)
ethtool -K eth0 tso on
MTU
Up to 64K
How does a packet get through the
Network Stack?
(c) Karen Sagovac
Packet Processing
Link Layer
Ingress QoS
Proto Handler
IPv4
IPv6
ARP
IPX
...
Drop
The Feast!
RX Handler
Open vSwitch
Team
Bonding
Bridge
macvlan
macvtap
Packet Socket
ETH_P_ALL
tcpdump
IP Processing
IP
Handler Route Lookup
PREROUTING
IPv4
Construction
Route Lookup
Local Output
OUTPUT
POSTROUTINGLink Layer
FORWARD
Forwarding
L4
(TCP, ...)
Local Delivery
INPUT
User
Space
TCP Processing
IP
Socket Filter
Receive TCP
Parse TCP
Lookup Socket
Backlog
socket locked
Receive Socket Buffer
Prequeue
task exists
process context ← softirq
Task
poll()read()
TCP Fast Open
(net.ipv4.tcp_fastopen)
2nd
Req SYN
SYN+ACK
ACK+HTTP GET
Data
2x RTT
SYN+Cookie+HTTP GET
SYN+ACK+Data
2nd
Req
1x RTT
Client Server
SYN
SYN+ACK
ACK+HTTP GET
1st
Req
Data
2x RTT2x RTT
Regular
Client Server
SYN
SYN+ACK+Cookie
ACK+HTTP GET
1st
Req
Data
2x RTT
Fast Open
Memory Accounting & Flow Control
Socket Buffers & Flow Control
(net.ipv4.tcp_{r|w}mem)
ssh
TX Ring Buffer
TCP/IP
Socket Buffer
wmem
overlimit?
Block or EWOULDBLOCK
wmem += packet-size
ssh
RX Ring Buffer
TCP/IP
Socket Buffer
rmem -= packet-size
rmem
overlimit?
Reduce TCP Window
rmem += packet-size
wmem -= packet-size
write()
TCP Small Queues
(net.ipv4.tcp_limit_output_bytes)
ssh
TX Ring Buffer
Driver
TCP/IP
Socket Buffer
write()
Queuing Discipline
torrent
Socket Buffer
write()
TSQ: max 128Kb in flight per socket
Q&A
Contact:
● E-Mail: tgraf@suug.ch
● Twitter: @tgraf__

LinuxCon 2015 Linux Kernel Networking Walkthrough

  • 1.
    Kernel Networking Walkthrough LinuxCon2015, Seattle Thomas Graf Kernel & Open vSwitch Team Noiro Networks (Cisco)
  • 2.
    Agenda ● Getting packetsfrom/to the NIC ● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO ● Packet processing ● RX Handler, IP Processing, TCP Processing, TCP Fast Open ● Queuing from/to userspace ● Socket Buffers, Flow Control, TCP Small Queues ● Q&A
  • 3.
    Touring the NetworkStack Expectation Reality
  • 4.
    How does apacket get in and out of the Network Stack?
  • 5.
    Receive & TransmitProcess Ring Buffer DMA Parse L2 & IP Parse TCP/UDP Socket Buffer Task / Container read() Ring Buffer Construct IP Construct TCP/UDP Local? Socket Buffer Forward Route? write() NIC Network Stack (Kernel Space) Process (User Space)
  • 6.
    The 3 waysinto the Network Stack Ring Buffer Network Stack Interrupt Driven A Ring Buffer Network Stack NAPI based Polling poll() B Ring Buffer Network Stack Busy Polling busy_poll() Task C
  • 7.
    RSS – ReceiveSide Scaling ● NIC distributes packets across multiple RX queues allowing for parallel processing. ● Separate IRQ per RX queue, thus selects CPU to run hardware interrupt handler on. RX-queue-1 RX-queue-2 RX-queue-3 RX-queue-4 CPU 1 CPU 2 CPU 1 CPU 2 filter
  • 8.
    RPS – ReceivePacket Steering ● Software filter to select CPU # for processing ● Use it to ... RX-queue-1 RX-queue-2 RX-queue-3 RX-queue-4 CPU 1 CPU 2 CPU 3 CPU 1 CPU 2 CPU 3 ... redo queue - CPU mapping ... distribute single queue to multiple CPUs
  • 9.
    Hardware Offload ● RX/TXChecksumming ● Perform CPU intensive checksumming in hardware. ● Virtual LAN filtering and tag stripping ● Strip 802.1Q header and store VLAN ID in network packet meta data. ● Filter out unsubscribed VLANs. ● Segmentation Offload
  • 10.
    Generic Receive Offload (ethtool-K eth0 gro on) Ring Buffer Network Stack poll() NAPI based GRO MTU GRO Up to 64K It's more effective to process 1x64K bytes packet instead of 40x1500 bytes packets.
  • 11.
    Segmentation Offload (ethtool -Keth0 tso on) (ethtool -K eth0 gso on) Ring Buffer Network Stack Generic Segmentation Offload (GSO) ethtool -K eth0 gso on MTU TCP Segmentation Offload (TSO) ethtool -K eth0 tso on MTU Up to 64K
  • 12.
    How does apacket get through the Network Stack? (c) Karen Sagovac
  • 13.
    Packet Processing Link Layer IngressQoS Proto Handler IPv4 IPv6 ARP IPX ... Drop The Feast! RX Handler Open vSwitch Team Bonding Bridge macvlan macvtap Packet Socket ETH_P_ALL tcpdump
  • 14.
    IP Processing IP Handler RouteLookup PREROUTING IPv4 Construction Route Lookup Local Output OUTPUT POSTROUTINGLink Layer FORWARD Forwarding L4 (TCP, ...) Local Delivery INPUT User Space
  • 15.
    TCP Processing IP Socket Filter ReceiveTCP Parse TCP Lookup Socket Backlog socket locked Receive Socket Buffer Prequeue task exists process context ← softirq Task poll()read()
  • 16.
    TCP Fast Open (net.ipv4.tcp_fastopen) 2nd ReqSYN SYN+ACK ACK+HTTP GET Data 2x RTT SYN+Cookie+HTTP GET SYN+ACK+Data 2nd Req 1x RTT Client Server SYN SYN+ACK ACK+HTTP GET 1st Req Data 2x RTT2x RTT Regular Client Server SYN SYN+ACK+Cookie ACK+HTTP GET 1st Req Data 2x RTT Fast Open
  • 17.
    Memory Accounting &Flow Control
  • 18.
    Socket Buffers &Flow Control (net.ipv4.tcp_{r|w}mem) ssh TX Ring Buffer TCP/IP Socket Buffer wmem overlimit? Block or EWOULDBLOCK wmem += packet-size ssh RX Ring Buffer TCP/IP Socket Buffer rmem -= packet-size rmem overlimit? Reduce TCP Window rmem += packet-size wmem -= packet-size write()
  • 19.
    TCP Small Queues (net.ipv4.tcp_limit_output_bytes) ssh TXRing Buffer Driver TCP/IP Socket Buffer write() Queuing Discipline torrent Socket Buffer write() TSQ: max 128Kb in flight per socket
  • 20.