Multicore networking in Linux user space with no performance overhead

In this Product How-To Design article , the Freescale authors discuss multicore network SoCs and how to leverage them efficiently for data path processing, the limitations of current software programming models, and how to use the VortiQa zero-overhead user space software framework in designs based on the QorIQ processor family.

System-on-chip architectures incorporating multiple general purpose CPU cores along with specialized accelerators have become increasingly common in the networking and communications industry.

These multi-core SoCs are used in network equipment including layer 2/3 switches and routers, load balancing devices, wireless base stations, and security appliances, among others. The network equipment vendors have traditionally used ASICs or network processors for datapath processing but are migrating to multi-core SoCs.

Multi-core SoCs offer high performance and scalability, and include multiple general purpose cores and acceleration engines with in-chip distribution of workloads. However, exploiting their capabilities requires intimate knowledge of SoC hardware and deep software expertise.

In this article we discuss multi-core SoC capabilities and how to leverage these capabilities efficiently for data path processing, limitations of current software programming models, and finally discuss a zero-overhead user space software framework.

Multicore SoC Hardware Elements
As shown in Figure 1 below a multicore SoC has multiple general purpose cores that run application software. It has hardware units that assist with data path acceleration. Incoming packets are usually directed toward the general purpose cores, where application processing takes place.


Click on image to enlarge.

Figure 1. A multicore SoC has multiple general purpose cores that run application software. It has hardware units that assist with data path acceleration.

Application cores make use of hardware accelerator engines to offload standard processing functions. Implementing networking applications on multi-core SoCs need certain basic requirements to be met by the SoC.

1. Partitioning: the SoC must provide the flexibility to partition available general purpose cores to run multiple application modules, or even different applications

2. Parsing, classification and distribution : Once partitioned, there must be flexibility and intelligence in the hardware to parse and classify incoming packets, and then direct them to appropriate partitions and/or cores.

3. Queuing and scheduling: When parsing is completed, the parsing unit must have a mechanism to direct the packet, and also for the system to have a mechanism to direct that incoming packet to a desired processing unit or core. This requires a queuing and scheduling unit within the hardware.

4. Look-aside processing: The queuing & scheduling unit must manage the flow of packets between cores and acceleration engines. Cryptography, pattern matching, compression/ de-compression, de-duplication, timer management, and protocol processing (IPSec, SSL, PDCP etc.) are some standard examples of acceleration units in multicore SoCs.

5. Egress processing: The queuing & scheduling unit must direct the packets to their interface destinations at very high rate Here QoS algorithms for shaping and congestion avoidance are required to offload these standard tasks from application cores.

6. Buffer management : Packet buffers need to be allocated by hardware, and often freed by hardware as packets leave the SoC. Therefore hardware packet buffer pool managers are a necessity.

7. Interfaces to cores: The multi-core SoC architecture need to present a unified interface to the cores, to work with packet processing units.

8. Semi-autonomous processing: Semi-autonomous processing of flows without intervention from cores is desired to offload some processing tasks from the cores. A few multi-core SoCs provide programmable micro engines to enable ingress acceleration on the incoming packets, to do functions such as IP reassembly, TCP LRO or IPsec, before packets are given to the cores.

Multicore SoC Software Programming Methods
Two models are prevalent in software programming for packet processing on the cores. One is pipeline processing where functionality is split across cores and packets are processed in pipeline fashion from one set of cores to the next as shown in Figure 2 below .

Figure 2. Pipeline processing model splits functional across cores and packets in a pipeline.

The more popular model is a run-to-completion model, where each core or a set of cores executes the same processing on a packet as shown in Figure 3 below.


Figure 3: In the run-to-completion network execution model, each core or a set of cores executes the same processing on a packet.

In the run-to-completion model, effective load balancing of packets across cores is important. It is also important to preserve packet ordering in flows, as network devices are not expected to cause re-ordering of packets in a flow.

This means that the packet scheduling unit should be intelligent enough to support a mechanism that ensures that packets of a flow are not sent to more than one core at the same time.

Otherwise the cores could complete processing of those packets at slightly different times and send them out in a different order. Thus order preservation mechanisms are an important part of the hardware scheduling unit, which can be leveraged by run-to-completion applications that are flow order sensitive.

It is often possible to combine pipelining with run-to-completion, where a group of cores are dedicated for certain application functions, another group to another set of functions and so on.

Within a group, all cores perform the same application processing on every packet, and once completed, hands off to the next group of cores that implement a different set of application functions.

High-performance Data Plane Processing
The software architecture of networking equipment typically comprises of data, control, and management planes. The data plane, also called the fast path, represents packet flows that have been validated and admitted into the system, and avoids expensive per packet policy processing.

Packets representing flows in the data plane pass through an efficient and optimized processing path, including some hardware accelerators. For example, a web download of a music file may move through the data plane of a device in the network path, after the device has established the flow as valid by processing the initial packets of that download in the control plane.

The control plane checks and enforces policy decisions that can result in establishing, removing or modifying flows in the data plane. It runs protocols or portions of protocols that deal with these aspects.

The management plane handles configuration of the device, such as installing policies or creating or removing virtual instances. It also manages other operational information and notifications such as device alerts. The rest of the article mainly concentrates on data plane processing.

Data plan processing on the network
Data plane processing in different network devices tends to use similar types of operations. Multicore SoCs accelerate and substantially improve performance of data plane processing, by providing mechanisms that address common data path processing elements. Typical data plane processing involves steps from ingress to egress, as illustrated in Figure 4 below.


Figure 4. In the typical network, data plane processing involves execution of multiple steps from ingress to egress

Packet ingress involves parsing, classification and activating the right application module to handle the packet. This is now facilitated in hardware, such as using a parse/classify/distribute unit. Packet (protocol) integrity checks may also be conducted at this stage.

The next step is core-based packet processing, by locating the context or flow associated with the packet, within the data plane. Much of policy related processing by application modules need not happen per packet. Instead, only the first (or a few) packets of a flow need to be processed thus in many cases.

When a flow context is not found in the data plane, the packet is sent to the control plane for policy lookup and enforcement. If policy allows, the control plane creates a flow context within the data plane. Further packets of the flow are matched against this context and are processed fully within the data plane.

A flow is typically defined by an N-tuple, which are fields extracted from the packet. A hash table lookup using these fields is the most common implementation of a flow lookup, to find its context. Both the extraction of necessary fields and the required hash computation can be offloaded to the hardware parsing unit of an SoC.

Within data plane processing stages there can be multiple application modules that need to process the packet in sequence. Each of these modules in the data plane may have its own control plane module that handles application specific flows.

An efficient communication mechanism between data plane and control plane modules is therefore required. This is essentially a core-to-core communication mechanism, facilitated by the hardware.

Each application module (that involves standard protocols) may implement some standard processing algorithms. Many of these algorithms, methods and even protocols are common enough to be implemented in look-aside hardware accelerators. An application module can then make use of these accelerators during appropriate stages of its own processing, by directing packets to those engines, and collecting responses.

One thing common to all data plane processing is handling of statistics. Statistics counters such as byte and packet counters and application specific counters often need to be kept per flow, and also per higher abstraction levels as required by applications. Therefore a large number of counters can be expected in higher end devices.

Since multi-core synchronized access to shared counters are costly, a multi-core SoC can also provide a statistics acceleration mechanism – one that would make incrementing statistics for millions of counters very efficient.

Once a packet is processed through all necessary application modules in the data plane, the packet is sent out to the egress interface. Typical processing here requires scheduling and shaping (or rate limiting). Since standard QoS algorithms are generally used, these functions can also be offloaded to hardware units, so that the application modules need only enqueue packets to egress processing units.

A zero-overhead user space network software framework
Multi-core SoCs software developers have been challenged with writing applications that suit specific SoC families and their derivatives. This means writing specialized code that is suitable and specific to a given SoC.

For example queuing, buffer management, statistics, accelerators and other technologies would have very specific modus operandi that applications must follow. There would also be specific methods of interfacing with the hardware to receive and send packets, and for distributing work.

These also result in application software architecture being dictated to some extent by the SoC. Migrating software across SoCs, even within the family of the same SoC product line can be a large and expensive development effort, and a burden on maintenance and support.

There is a need for a software framework that is able to leverage features provided by a multicore SoC, without the need for in-depth expertise in the hardware operational understanding.

Applications need to be portable, and be able to leverage different SoCs and families without software application changes, essentially through configuration of features and an abstracted execution environment.

Much like a traditional (i.e. non-embedded) operating system that hides many hardware details from applications, a network software framework that hides SoC specific details and offers a consistent programming model to applications is the need of the hour.

Limits of Linux in the dataplane
Direct use of Linux kernel for data plane implementations has limitations. Linux kernels provide abstraction for disk I/O, USB, processor features and other hardware elements. However, scaling to millions of flows/sessions in datapath is not easy in the Linux kernel space. Kernel resident applications suffer from limited memory, an environment that is hard to develop and debug in. Vendors also have GPL concerns with Linux kernel modules.

In order to overcome the limitations of Linux, it is required to execute applications in user-space with virtually zero-overhead and with direct access to the SoC hardware, and provide a software framework, which caters to various needs of networking applications without requiring knowledge of hardware specific details.

Such a framework needs to support layer 2, layer 3 and higher layer processing, orchestrate packet flow, manage packet buffers, provide access to hardware accelerators, timers, and statistics.

It also needs to support inter-application communication, and provide multiple execution models for applications. A network software framework in user-space that leverages the advantages of Linux OS and overcomes its limitation for networking and embedded application is essential for next generation of networking and embedded applications.

VortiQa Platform Services Package (VortiQa PSP)
An example of such a framework is Freescale’s VortiQa Platform Services Package (VortiQa PSP), which provides an infrastructure (Figure 5, below ) for direct access to SoC hardware from user-space, as well as well defined layers, methods and hooks that facilitate packet processing. It is a multi-core, high performance, SoC-agnostic, direct hardware access user-space execution environment for networking applications.


Clickon image to enlarge.

Figure 5. VortiQa PSP is a software framework that delivers a user space programming framework for networking applications on Freescale QorIQ hardware platforms.

VortiQa PSP provides an infrastructure for direct access to QorIQ based hardware from user-space, along with well defined layers, methods and hooks that facilitate packet processing. As illustrated in the schematic, a VortiQa PSP instance is represented by a user-space process.

The process has direct access to hardware, through user-space memory mapping. Linux is used mainly to host the user-space VortiQa PSP application with threads that may be bound to cores. This implementation supports a zero-overhead Linux user-space execution with minimal scheduler intervention.

PSP layers and functions. A VortiQa PSP instance can be viewed as having multiple layers, with a VortiQa PSP core layer providing essential services. Central to the VortiQa PSP core is the VortiQa PSP execution engine, which can consist of multiple worker threads that may be affine to one or more cores of the SoC.

The execution engine orchestrates packet processing with load balancing, handling of events including communications with other entities of the system, and generally drives application code execution. It directly accesses hardware packet I/O interfaces, by polling or in combination with Linux mechanisms like UIO.

Packet processing internally uses whatever hardware mechanisms are available to receive and send packets, hiding these details from application modules. The execution engine handles initialization and termination of the instance, including release of resources back to Linux.

The engine also provides a whole host of utilities such as timers, locking mechanisms, RCUs, buffers management, statistics counters and so on that leverage SoC capabilities, but can be used by applications in a hardware agnostic way.

The packet execution engine drives the network service layer. This layer provides essential support for layer 2 & layer 3 packet processing, including neighbor table learning, bridging, and IP forwarding with fragmentation/re-assembly.

The network databases such as routing, ARP and interface databases are synced with kernel databases, so that routing and bridging control plane (routing protocols, spanning trees etc.) can continue to work as usual. Applications may register hooks with the network service layer to intercept and process packets, similar to netfilter hooks in Linux Kernel.

To access SoC components like accelerators, a driver layer exists between the packet engine and the network services layer. This layer provides the capability to work with the hardware directly from user-space through memory mapped configuration and operational address spaces. These components thus represent user-space versions of hardware drivers.

Above the network services layer, applications may register with the L4 services layer. This layer supports session/flow management commonly required for stateful networking applications such as firewalls, QoS functions, intrusion detection, VPNs and so on.

The Session manager manages these flows, allows application modules to store module-specific state in these flows, and orchestrates packet flow through the modules. The session manager also provides support for network address translation and proper sequencing of TCP packets. Application modules may work with session manager registered hooks, to receive, process and send out packets without dealing with hardware specifics.

Conclusions
The required number of VortiQa PSP instances, hardware and resource configurations, and execution modes are all controlled through policy files. Depending on the specific hardware platform and application requirements, policy files may be created to manage execution under the VortiQa PSP environment.

Applications do not have to change when moving from one hardware platform to another, or to another of the same family. Either the VortiQa PSP code underneath, its policy files, or both would change to facilitate this migration.

VortiQa PSP provides an efficient operating environment to many classes of networking applications. The framework is highly optimized for performance, and leverages Freescale QorIQ processor family and provides the software application developer an easy to use performance optimized framework in Linux user space.

Subramanyam Dronamraju is responsible for Software Business Development with Freescale. He has more than 22 years of experience in the software industry and holds degrees in electrical engineering and management.

Srini Addepalli is Chief Software Architect at Freescale Semiconductor and previously served at HP, NEC and Holontech. He has over 20 years of experience mainly in networking and security technologies.

John Rekesh is a Software Architect with Freescale and holds an MS degree from Indian Institute of Technology, Madras. He has over 20 years of experience in the software industry.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.