Accelerating network packet processing in Linux

As a general-purpose operating system, Linux is not designed for effective operation and high throughput for networking traffic. Moreover, if Linux is to run the system on a multicore platform, the throughput of system is much less than that of the SoC hardware's capabilities. This issue of under-utilization of hardware capabilities arises from the fact that the standard network stack of Linux is a generic stack with much higher overhead, thereby reducing overall system performance compared with the capability of the underlying hardware. This article discusses the issues and proposes a solution. We also present an example of the improved software architecture implemented on the dual-core processor hardware with actual packet data being used to assess performance.

Linux is being widely used in network products. Linux and other operating systems are generic and need to cater to different types of applications. Generality always comes with the price.

For example, forwarding process of IP packets in Linux go through many layers of software from reception of the packet to the transmit of the packet. These higher layers of the operating systems don't have full control over the hardware (processor and accelerator) features. These layers add to cycles. Since many layers of software are involved, cache utilization is not very good. On top of it, context switching and locks (for multcore processors) in these generic layers would use up more core cycles.

An OS networking stack uses OS services and inherits the OS's limitations, such as preemptions, threads, timers, and locking. These OS limitations lead to performance bottlenecks like L1/L2 CPU cache misses, pipeline mispredictions, and a lack of scalability due to the locks. Moreover, it also brings additional development complexities, such as the usage of locks into the kernel or RCU (Read Copy Update) with some recent Linux Kernel 2.6.

You can perform several kinds of optimizations on Linux to enhance the overall packet-processing capabilities. It can be in a network stack, optimization for specific hardware, interrupt-handling mitigation, and so forth. Although an optimized Linux networking stack for SMP architectures can significantly improve performance, it cannot scale because all the packets are managed by the OS stack, which has many limitations. We cannot easily estimate the overhead, but this issue can be observed by increasing the number of cores, you'll observe that the performance does not increase linearly with the number of cores. The best example is in a Linux IPsec benchmark, which shows that the stack does not scale beyond two cores. This issue is not specific to Linux. The increasing development complexities results in the slowdown of the migration of networking applications on multicore CPUs efficiently.

However, with all possible solutions, it still remains impossible to bring the application-specific network performance up to the capacity of underlying hardware and justifying the cost.

Data-path processing
The simple solution to solve these problems is to rewrite the Linux networking stack. That said, it is not an easy job. Especially considering the ever-increasing wide acceptance of OS like Linux, it may not be feasible to have another OS just for network processing.

Something different has to be done here. In a typical network-processing applicaton, there can be thousands of flows. All flows are created equal. After the initial setup/verification, most of these flows require a simple and deterministic processing. By recognizing and caching such flows and processing such packets in a separate highly optimized context, these flows can be put on a fast track.

Any packet processing can be broken into:

  • Control path requires more processing and has more inherent latency than the data path. Control path requires 90% of the code and is used 10% of the time.
  • Data path requires quick and efficient processing of packets. Data path requires just 10% of the code and is used 90% of the time.

The obvious question would be how to accelerate this 10% data processing code?

A fast-lane/fastpath or data-path accelerator is the answer for this problem. Fastpath is nothing new. This has been adopted by networking vendors for a long time. Fastpath is specific to applications running in the networking devices. For example, in networking devices, fastpath can be implemented for firewall, IPsec VPN, QoS, and forwarding, among others.

There are different types of fastpath implementations in the industry today.

  • ASIC-based fastpath implementations.
  • Network-processor-based fastpath implementation
  • Control plane and data plane with some cores running CP and some cores running Bare Metal DP or executive fastpath.
  • Linux-network-driver-based fastpath for devices that use Linux SMP.

Figure 1: Types of fastpath

Click on image to enlarge.

There are some pros and cons among different fastpath approaches but the idea remains the same. Do the routine work of packet processing in fastpath and let normal path handle connection setup, special connections, control packets, core database, etc. Hardware-based solutions comes with a cost–both manufacturing and power consumption. The performance gap between the executive-based fastpath and Linux-based software fastpath can be bridged by the intelligent design.

(Note: We limit the discussions in this paper to Linux-based software fastpath only.)

Data-path acceleration using fastpath
What are the requirements to keep in mind when designing a data-path acceleration module in software–in other words, software-based fastpath:

  • It can be application specific: one size does not fit all.
  • Avoid multiple lookups: Keep all the flow classification and processing information as one place.
  • Leverages hardware functionality in software like hashing, checksum calculation, cryptography, classification, scheduling to provide higher throughput. Fastpath runs close to the underlying hardware and it should have access to all hardware features.
  • The fastpath is always optimized for a specific type of hardware, but a layer architecture can also be created. It shall take advantage of all features without worrying about losing generality. Since it should have a smaller memory and code footprint, generality across all hardware devices can be compromised.
  • Fastpath implementations follow run-to-completion model. This reduces the context switching and better cache usage.
  • Due to its small footprint, most of the fastpath code might fit in the L1 Cache of processors.
  • Locks are main evil to performance. Make the implementation as lockless as possible. Use RCUs and other intelligent data structures to avoid locking in packet-processing path.
  • Use methods like interrupt coalescing and packet steering for proper balance in system.
  • Leverage buffer recycling for faster memory operation.
  • Use of cache stashing and locking enhances the performance.
  • Efficient and optimized table lookup algorithm with instruction prediction to reduce latency in database lookup. Cache alignment of the table structure also helps.

ASF–application-specific fastpath
The purpose of this application-specific fastpath (ASF) is not to replace the complete Linux networking stack but to accelerate the packet processing of most commonly used functionalities, such as IPv4 forwarding, NAT, firewall, IPSEC, and QoS. The stateful intelligent network processing still continue to happen in Linux networking stack, while the stateless fast packet processing will be done in ASF. But this does not stop one from using a stateful ASF.

Figure 2: ASF layered architecture

Click on image to enlarge.

An ASF implementation can be divided into three components:

1. ASF packet engine: An actual data-packet processor that closely interacts with network and security drivers for packet handling and processing.
2. ASF configuration APIs: To configure the control information in ASF packet engine. The purpose of these APIs is to provide a generic interface for the ASF configuration, where ASF can be of any type or ASF control logic may be interfacing with any other networking stack or OS.
3. ASF control logic: Interfaces with the Linux networking stack to offload the required packet-control information and uses ASF config APIs to configure the ASF packet engine.

All the packets entering the system are forwarded from an Ethernet driver to ASF module. In ASF module, a given packet flow (based on L2/L3/L4 header information) is checked in ASF lookup tables, which gets populated through ASF control module. If a matching flow is found for received packet, it gets processed as per the action configured for that flow and forwarded to the configured egress interface or terminated locally.

Figure 3: Software model of ASF

Click on image to enlarge.

The packets for the flows that are not configured are forwarded to the Linux network stack for normal-path processing. In Linux network stack these packets are searched for a matching entry in various Linux lookup tables, such as socket, routing FIB, routing cache, IPsec Policy, and SA database. If rules are found, packets are processed as per Linux implementation and in addition, if the matching criteria get qualified, these flows are offloaded to ASF so that subsequent packets for same flows are processed through ASF.

The basic functionality of ASF module is to have set of rules configured and process the incoming packets on the basis of configured rules. When a packet is received at ASF receive interface, it will be parsed and a lookup is done (as shown in Figure 3 ) to check whether it matches to any existing rule. If it matches, the packet is processed through ASF; otherwise it is returned to slow path through Linux network stack.

Table 1 gives an example for some of the networking applications, where data path is handled in ASF and control path is in Linux networking stack:

Table 1: Data path is handled in ASF; control path handled in Linux networking stack.

Click on image to enlarge.

Control information offloads from Linux network stack
ASF control module interfaces with Linux networking stack at different function points.

1. Routing/neighbor table: For IPv4 forwarding information
2. IP conntrack module: For L3/L4 connections tracking
3. XFRM framework: IPsec key manager for IPsec session offloads
4. QoS framework: For offloading queue-discipline configuration. This includes a scheduler, policer and classification rules, and other parameter configurations in ASF.

The existing Linux user space tools, such as Iproute2, ipsec-tools, iptables, tc, and vconfig, can be used seamlessly to configure the Linux networking stack. The ASF control plane will either receive the required information from the above-mentioned function points in Linux kernel or it will tap the user-space configurations at Netlink interface.

Let's examine in more detail some of these examples:

IPv4 forwarding: IPFwd is a ipv4 route-based forwarding application. It can forward packets from one Ethernet interface to another Ethernet interface based on the source IP address, destination IP address, and ToS fields. Route can be dynamically/statically added or deleted.

In Linux, whenever a packet supposed to be forwarded hits the routing cache, the Forwarding information is extracted from the routing-table entry. At this point the forwarding flow is offloaded to the ASF if not already done. In the absence of an existing framework to get the control information, hooks can be added ip route modules in Linux (such as route.c).

On the other hand, ASF performs the following on each packet that is engaged in IPv4 forwarding:

  • Parsing
  • Lookup and classification
  • TTL decrement with IP checksum
  • ToS-based forwarding.
  • Addition of update Layer 2 header (Ethernet etc)

NAT/NAPT/Firewall: It's not necessary to handle each and every flow in ASF. The most commonly used TCP and UDP can be processed in ASF and remaining flows can be handled in Linux networking stack. The TCP/UDP flows that require special processing can still be handled in Linux; for example, packets based on application layer gateway.

ASF control module offloads the NAT flows to ASF from Linux that are identified by 5-tuple information, which includes Source IP, Destination IP, Source Port, Destination Port, and Protocol. All packets entering the ASF module are parsed and matched against these flows. If the flows are found, L2 and L3 headers are modified as per the configured flows and forwarded to configured interfaces.

ASF control logic registers to the Linux conntrack notifier subsystem and listens to the connection events. As the events become assured, the control logic extracts the relevant information from the notifier's event structure for programming in the classifier parameters, TCP state tracking, and Timestamp checking mechanism.

static int asfctrl_conntrack_event(unsigned int events, struct nf_ct_event *ptr);static struct nf_ct_event_notifier asfctrl_conntrack_event_nb = {	.fcn = asfctrl_conntrack_event};…..need_ipv4_conntrack();nf_conntrack_register_notifier(&asfctrl_conntrack_event_nb);

Figure 4: Connection offload from Linux

Click on image to enlarge.

When the ASF control module gets notifier events from the netfilter or IP conntrack system whenever new connections are established or existing connections are updated or get destroyed/aged-out by the Linux conntrack system. For each new connection on which bidirectional handshake has been achieved and the connection is put in 'assured' state by the conntrack, an assured event will be received by the connection module, which will trigger its internal mechanism to offload the connection to the ASF. Since only L3/L4 information is received in the notifier event, it will lookup the kernel routing and neighbor/ARP tables for getting the L2 information like MAC address, outgoing/incoming ports, etc., which is necessary for offloading the connection.

For each assured-connection event, ASF control module will also get information about current State & Sequence of the connection, which will be required for stateful firewall inspection and TCP sequence checking. This information will be received for each of the bidirectional flow, which together makes a connection.

The teardown of established connections is handled in the following way: once a connection is destroyed by the IP conntrack system, the ASF control module will receive a DESTROY event and schedule corresponding work-queue to remove entry from the ASF local flow table.

IPSEC processing: ASF control module handles the offloading of IPsec related policies (SPD), security associations (SAs), and flow-related information. It registers to the IPsec key manager for receiving the policy and SA add/delete/modify events; it extracts the relevant information and configure the policy container and SA information in ASF using ASF API. It also provides the helper functions for the IPV4 or firewall. These helper function checks the flow being offloaded for the IPsec policies.

static struct xfrm_mgr fsl_key_mgr = {	.id             = "fsl_key_mgr",	.notify         = fsl_send_notify,	.acquire        = fsl_send_acquire,	.compile_policy = fsl_compile_policy,	.new_mapping    = fsl_send_new_mapping,	.notify_policy  = fsl_send_policy_notify,	.migrate        = fsl_send_migrate,};………..xfrm_register_km(&fsl_key_mgr);

Local IP termination: ASF can also be used for accelerating the packet processing for specified local terminating/originating traffic in user space. ASF can be used for provide zero copy termination interface in user space with direct packet access with Ethernet driver for the configured UDP Ports/IP addresses.

Best practices
Accelerating packet performance on a muticore device is a multi-dimension problem. It is much easier to provide excellent performance benchmarks on a simplified and specific application, for instance, bi-directional IP forwarding. But field applications are much more complex to handle in terms of functionality and performance. An efficient packet processing requires a combination of the best design practices which includes:

  • An optimized and dedicated fastpath-based architecture able to linearly scale over the number of cores.
  • A transparent synchronization between the control plane, slow path, and fast path.

Methods described in this article are implemented and evaluated on Freescale multicore QorIQ platform series. These experiments shows that with above mentioned approaches network throughput can accelerated to around 2x to 10x for smaller packet size. For larger packet size, the benefit was observed in terms of reduction in overall CPU utilization. Multicore scaling improvement was also observed in the range of 10 to 30%.


  • Agrawal Hemant and Malik Sandeep: Software Based Data Path Acceleration For IPsec Processing, ICNCC 2011.
  • Michael G. Iatrou, Artemios G. Voyiatzis, Dimitrios N. Serpanos: Network Stack Optimization for Improved IPsec Performance on Linux. 83-91, Network Security and Protocols, 2009: Milan, Italy
  • R. Lehmann, M. Benz, S. Groß, and M. Hampel (Germany), “IPsec Protocol Acceleration using Network Processors,” ACTAPress Proceeding (408) Communications, Internet, and Information Technology–2003.

Hemant Agrawal is a software architect for the Networking Processor Division of Freescale working on QorIQ product line. He holds a bachelor's degree in electrical engineering from Institute of Technology, BHU, India.

Manish Dev is a software design manager for the Networking Processor Division of Freescale working on QorIQ product line. He holds a bachelor's degree in electrical engineering from Delhi College of Engineering, Delhi, India. s

3 thoughts on “Accelerating network packet processing in Linux

  1. There have been many different terms and concepts associated with TCP offload and one of them is the 10G Bit TCP offload which has a significant role played in the entire process. It comes with unique and amazing features and benefits making it an exceptio

    Log in to Reply
  2. It seems to me that adding more layers, particularly “intelligent” layers, makes these systems more vulnerable to security issues.

    What were just dumb peripherals under OS control (such as ethernet) are becoming communications “subsystems&q

    Log in to Reply
  3. “This amazing advance technology has capabilities such as direct data sourcing, application layer data advance integrity check, traffic management, direct data advance placement and many more beneficial capabilities.n”

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.