Using fastpath software to boost performance of Linux-based home network routers - Embedded.com

Using fastpath software to boost performance of Linux-based home network routers

Editor’s Note: Anton Mikanovich of Promwad describes how to use a fast path implementation of the Linux OS to boost performance of a Small Office/Home Office traffic router design, using Marvell’s new ARMv5TE-based Kirkwood processor.

High-speed data transfer networks ubiquitous in today’s world. We use them while working on the computer, making phone calls, watching digital TV, receiving money from an ATM machine, and in any situation where we need to transfer digital information. The greater the volume of information and the number of its recipients, the more stringent are the speed and throughput requirements.
The defacto standard for data transfer in most computer networks is Ethernet and TCP/IP. These protocols allow for different topologies dividing large initial networks into subnets using routers. The simplest way of building a network is shown in Figure 1 :

Figure 1: A basic network with a router

When transferring information flow Computer A to Computer B, the traffic in packets comes to the router interface eth0, which forwards the packet to the operating system where it passes through different levels of the TCP/IP protocol stack and is decrypted to determine the future path of the packet. After receiving the destination address and determining the redirection rules, the operating system packs the packet again, depending on the protocol used, and puts it out via the eth1 interface.
Only some of the header fields change; the bulk of the package remains the same. The faster the packet goes through all these stages, the greater capacity the router can achieve. While the problem of enhancing router performance was not a big issue when networks had a capacity of 100 Mbit/s, with the advent of gigabit speeds there is a need to improve the efficiency of equipment.

It is easy to see that this thorough traffic processing is redundant for most packets of known types. By sifting and redirecting packets not intended for the device itself at an early stage, you can greatly reduce the traffic processing time. This processing is most often performed before coming to the operating system, which reduces latencies.

This technology minimizes the packet path, hence the name fastpath. Since this acceleration method is based on the low-level part of the network stack and involves information exchange with the network driver, the specific fastpath implementation technology depends on the equipment used.

This article describes how to implement such a scheme using Marvell’s Kirkwood processor architecture, a system-on-chip (SoC) based on the ARMv5TE-compatible Sheeva architecture. Processors based on this architecture are designed specifically for use in network devices such as routers, access points, STB devices, network drives, media servers and plug computers.

The Kirkwood line includes processors with one or two cores and an extensive set of peripherals. Operating frequencies range from 600 MHz to 2 GHz. The entire line has 256 KB L2 cache on board. Older dual-core models also boast FPU.

The basic features of the Marvell Kirkwood processors are given in Table 1 below.

Table 1: Marvell Kirkwood processor datasheet

Network Fast Processing
Since the Kirkwood processor family targets applications that include traffic redirect devices, Marvell also faces the need to implement fastpath in their devices. To solve this problem, engineers have added Network Fast Processing (NFP) to the HAL part of the platform support driver in the Linux 2.6.31.8 kernel.

The relationship between Marvell NFP and other parts of the Linux operating system is shown in Figure 2 below:

Figure 2: Marvell NFP in the Linux operating system

NFP is implemented as a layer between the gigabit interface driver and the operating system network stack. In short, the basic principle of traffic transfer acceleration is to sift incoming routed traffic packets and output them through the required interface, bypassing the operating system. In addition, the packets not intended for the local interface or that cannot be processed in fastpath are forwarded to the Linux kernel for processing through standard means.

Marvell’s fastpath implementation does not process all possible packet formats, but only the most popular protocols up to the OSI / ISO model transport level. The chain of supported protocols can be roughly presented as follows:

Ethernet (802.3) → [ VLAN (802.1) ] → [ PPPoE ] → IPv4 → [ IPSEC ] → TCP/UDP

Support for higher level protocols is not necessary because this information is not used for routing. Transport protocol header analysis is required for NAT.

A modular structure makes it possible to configure the used parts at the stage of Linux kernel compilation. The following are optional parts:

  • FDB_SUPPORT – a hash table of matching MAC addresses and interfaces
  • PPP – PPPoE support
  • NAT_SUPPORT – IP address translation support
  • SEC – IPSec encryption protocol support
  • TOS – replacing the type of service field in the IP-header based on the iptables rules

The forwarding database (FDB) is a traffic redirect database located in the Linux kernel. Unlike the routing table, FDB is optimized for quick entry search. Marvell’s fastpath implementation uses its own local ruleDB table in which an entry is added or deleted out of the Linux network stack; the stack code is modified accordingly.

RuleDB for quick search is a hash table with key value pairs, where the value is usually a rule for redirecting a packet with a specific destination address, and the key for fast indexing of this rule is an index generated from the source and destination addresses using a special hash function. The best designed hash function provides the highest chance of matching one index with one rule.

Initially FDB (and, consequently, ruleDB) is empty, which is why every first packet (a packet with no existing FDB entry) goes to the kernel, where a rule is created after processing. After a specified timeout, the entry will be removed from FDB and ruleDB in NFP.

The traffic handling process is as follows:

  1. The raw data of the packet received is sent to the NFP input.
  2. If the packet is intended for a multicast MAC address, it is sent to the TCP/IP OS stack.
  3. If you use FDB and the table does not include an entry for this MAC address, the packet is sent to the OS stack.
  4. An entry for this MAC address is extracted from FDB. If the address is not marked as local, the system recognizes it as connected in bridge mode and sends the packet through the interface specified in the FDB table entry.
  5. If the system detects a VLAN or PPPoE header, it discards it and calculates a link to the IP header beginning.
  6. Labeled fragments of packets are sent to the OS network stack.
  7. If the package contains ICMP data, the data is forwarded to the OS stack.
  8. Packets with expired lifetime are sent to the OS stack. Of course, these packets should be discarded, but the TTL expired ICMP reply should be generated using the operating system means.
  9. The system launches an IPSec header check and begins to process such packets accordingly with a certificate check.
  10. The system launches a search for the Destination NAT rule to determine the Destination IP address of the packet.
  11. If a given destination address does not exist, the packet is sent to the network stack. Such packets should also be dropped, but the system should generate the corresponding ICMP response.
  12. The system launches a search for the Source NAT rule and updates the IP and TCP / UDP header fields according to the DNAT and SNAT rules.
  13. Based on the routing table, the system calculates an interface through which the packet will be put out. 
  14. If the outgoing interface requires PPP tunneling, the IP packet is wrapped in a PPPoE header, previously reducing TTL and updating the Ethernet header. In this case we cannot calculate the IP packet checksum through hardware means, so we should recalculate the checksum. However, because we know the old checksum and the change in the packet data, we do not have to make all calculations anew, but only adjust the sum by the required amount. If the packet size exceeds the maximum, we forward the packet to the operating system stack.
  15. In all other cases, the process involves updating the Ethernet header, the checksum, and the type of service field (if necessary and if there is an entry in the ip tables).
  16. The received Ethernet packet comes out through the required network interface.


This sequence of checks is shown in Figure 3 .

Figure 3: Processing a packet in NFP

Trafficprocessing in NFP is basically a set of checks for the most commonspecial cases. This is not a solution for all types of packets. However,in most cases, such a set of protocols is sufficient to achieve atangible increase in network routing performance.

There aredrawback to Marvell’s fastpath implementation. For example, we cannotignore cases of directing traffic to the operating system kernelwhenever it is necessary to generate ICMP packets. This will increasethe load on the router in case of network attacks or any increasedamount of ICMP traffic. In the case of a large amount of multicasttraffic, the router will experience an increased load because thetraffic is not handled by NFP and passes through the OS network stack.Also, this implementation does not support IPv6, but the developers madea provision for its support in the future.

On the minus side offastpath in general is that it splits the CPU time with the Linuxoperating system and therefore does not cover all possible resources.However, we can solve this problem using Marvell multiprocessorsolutions, such as quad-core Armada XP processors.

Changing router performance
Whatimpact does the application of Network Fast Processing have on therouter performance in real systems? To answer this question, we shouldcalculate the speed of packets passing through the router with NFPenabled and with NFP disabled.

As a test unit, we’ll consider arouter based on the SoC Marvell Kirkwood 88F6282 processor with a clockfrequency of 1 GHz (Figure 4). This processor features two 1000Base-TXnetwork interfaces, which makes it a good choice for this type ofdevice.

Figure 4: Marvell Kirkwood 88F6282 architecture

Datatraffic on most networks does not show constancy in time, so anevaluation of real performance requires a hardware and software trafficgenerator.

PackETH is a utility with a graphical interface forgenerating Ethernet frames. There are versions of this utility forLinux, Windows, and Mac. This is one of the easiest-to-use tools fortraffic generation. It features the following capabilities:

  • Generation of Ethernet II, Ethernet 802.3, 802.1q and QinQ frames or user-defined frames
  • Supported protocols: ARP, IPv4, IPv6, UDP, TCP, ICMP, ICMPv6, IGMP and RTP (with the option of setting the payload), or user-defined protocols
  • Generation of Jumbo frames (if supported by the driver)
  • Sending a queue of packets with the adjustable latency and the number of packets
  • The option of saving the settings

The PackeTH graphic interface has the structure shown in Figure 5 .

Figure 5: PackETH interface

iperfis another solution for traffic generation. iperf is more popular thanPackETH, but offers virtually no possibilities for setting packetformats. This command-line utility can measure network performance bygenerating and accepting TCP and UDP packets.

You can start using it by simply running one copy of the application in server mode through the command:

  # iperf -s

Then we should run another copy on a second machine in client mode with address or server name:

  # iperf -c server_host

Within ten seconds, the program will measure the network throughput and display the result.

Inaddition, the possibility of directly generating UDP traffic providesthe pktgen kernel module. We can configure the parameters of thegenerated packets using the directory/proc/net/pktgen of the procfs file system. The simplest configuration is defined as follows:

  # echo “add_device eth0” > /proc/net/pktgen/kpktgend_0
  # echo “count 1000” > /proc/net/pktgen/eth0
  # echo “dst 192.168.1.1” > /proc/net/pktgen/eth0
  # echo “pkt_size 1000” > /proc/net/pktgen/eth0
  # echo “delay 50” > /proc/net/pktgen/eth0

We start the generator:

  # echo “start” > /proc/net/pktgen/pgctrl

After the generator completes its work in the status/proc/net/pktgen/eth0 , the system will display the upload speed.

Themain advantage of pktgen is that it generates a packet to be sent onlyonce, and then sends copies of the packet, which helps achieve higherspeeds. There are other solutions for traffic generation and measuringthe network throughput, such as brute, netperf, mpstat, and sprayd.

Sincewe do not need to check all possible cases, it will be enough to usethe iperf capabilities. We will transmit TCP and UDP 1400-byte packetsin two modes, with Network Fast Processing disabled and enabled. We candirectly control NFP through procfs using the manager/proc/net/mv_eth_tool . For example, to disable NFP, you can simply send the command “c 0” to the manager:

  # echo “c 0” > /proc/net/mv_eth_tool

where “c” is the command code and “0” is the set NFP status.

Now we measure the network performance in these modes and enter the results into Table 2 .

Table 2: The results of measuring the network performance

Sincethe actual throughput is highly dependent on the configuration of thedevice and the applications that are running, we should not focus on theabsolute values received. First of all, we are interested in the growthof productivity when the NFP is used. In case of TCP traffic, thethroughput almost doubles (96 per cent). For UDP packets, the effect isnot so strong. It leads to a growth of 63 per cent, but this is not abad result.

Anton Mikanovich is a software engineer at Promwad Innovation Company,focused on developing Linux-based networking products. He works withscripting languages in embedded systems and web interfaces for customerpremises equipment.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.