Making packet processing more efficient with a network-optimized multicore design: Part 1 - Embedded.com

Making packet processing more efficient with a network-optimized multicore design: Part 1

With the advent of the latest generation of multi-core processors it has become feasible from the performance as well as from the power consumption point of view to build complete packet processing applications using general purpose architecture processors, rather than dedicated ASIC and ASSP SoCs.

Architects and developers in the industry are now considering these processors as an attractive choice for implementing a wide range of networking applications, as performance levels that could previously be obtained only with network processors (NPUs) or ASICs can now also be achieved with multi-core architecture processors, but without incurring the disadvantages of the former.

Why multicore?
Ideally, a single core processor should be powerful enough to handle all the application processing. However, a single core cannot keep up with the constant demand for ever increased computing performance.

The impact of improving the core internal architecture or moving to the latest manufacturing process is limited. Higher clock frequencies also results in considerably higher energy consumption and further increase in the processor-memory frequency gap.

However, with the advent of the latest multi-core processors, it has become feasible from the performance as well as from the power consumption point of view to build complete packet processing applications using general purpose architecture processors.

Architects and developers in the industry are now considering these processors as an attractive choice for implementing a wide range of networking applications, as performance levels that could previously be obtained only with network processors (NPUs) or ASICs can now also be achieved with multi-core architecture processors, but without incurring the disadvantages of the former.

Control plane versus data plane: core partitioning
In such multicore based networking applications, the data plane, also called the forwarding plane or the fast path, handles the bulk of the incoming traffic that enters the current network node.

It is processed according to the rules identified during the classification stage and is sent back to the network. The packet processing pipeline typically includes stages like parsing, classification, policing, forwarding, editing, queuing and scheduling.

In terms of computing, the data plane is synonymous with real time packet processing. The real time constraints are due to the fact that the amount of processing applied per packet needs to fit into the packet budget, which is a direct consequence of the input packet rate.

In other words, each stage along the pipeline must apply its processing on the current packet before the next packet in the input stream arrives; if this timing constraint is not met, then the processor must start dropping packets to reduce the input rate up to the rate it can sustain.

Due to the tight packet budget constraints, the processing applied per packet needs to be straightforward and deterministic upfront. The number of different branches which can be pursued during execution should be minimized, so that the processing is quasi-identical for each input packet. The algorithm should be optimized for the critical path, which should be identified as the path taken by the majority of the incoming packets.

In contrast with the data plane, the control plane is responsible for handling the overhead packets used to relay control information between the network nodes. The control plane packets destined to the current node are extracted from the input stream and consumed locally, as opposed to the bulk of the traffic which is returned back to the network.

The reception of such packets is a rare event when compared with the reception of user packets (this is why the control plane packets are also called exception packets), so their processing does not have to be real time. When compared to the fast path, the processing applied is complex, as a consequence of the inherent complexity built into the control plane protocol stacks, hence the reference to this path as the slow path.

As the processing requirements associated with the two network planes are so different, it is recommended practice that the cores dedicated to data plane processing be different than those handling the control plane processing. As the application layer requires the same type of non-real time processing as the control plane, it usually shares the same cores with the latter.

When the same cores handle both the data plane and the control plane/application layer processing, a negative impact on both may be observed. If higher priority is given to the control plane against the data plane, the handling of the input packets is delayed, which leads to lengthy packet queues as result of network interfaces keeping them well supplied with packets received from the network, which eventually ends up in congestion and packet discards.

If instead the data plane gets higher priority than the control plane, then the delay incurred in handling the hardware events (e.g. link up/down) or the control plane indications (e.g. route add/delete) results in analyzing them when they are already obsolete (the link that was previously reported down might be up by now).

This behavior usually has an impact on the overall quality of the system (packets are still getting discarded, although a route for them is pending for addition) and results in a non-deterministic system with hidden stability flaws.

The role of the operating system
It is standard practice to have the cores allocated to control plane/application layer running under the control of an operating system, as these tasks do not have any real time constraints attached to them with regard to packet processing. In fact, the complex processing which has to be applied and the need to reuse the existing code base make the interaction with the OS a prerequisite.

On the other hand, there are strong reasons to discourage the use of an OS for the cores in charge of the data plane processing. First of all, no user is present, so there is no need to use an OS to provide services to the user or to restrict the access to the hardware. One of the important OS roles is to regulate the user's access to hardware resources (e.g. device registers) through user space / kernel space partitioning.

Typically, the operating system allows the user application to access the hardware only though a standardized API (system calls) whose behavior cannot be modified by the user during run-time.

The user does not need to interact directly with the fast path, as the packet forwarding takes place automatically without any need for user's run-time input.

The user might influence the run-time behavior of the fast path indirectly through interaction with the control plane by triggering updates of the data structures shared between the fast path and the slow path, but the code that updates these data structures is typically kernel code running on the control plane cores, which does already have full access to hardware.

Secondly, the main functions typically handled by an OS are not required:

1) Process management is not required, as there is only one task, which is very well defined: the packet forwarding. Even if a programming model with several tasks synchronizing between themselves would be imagined, the cost of task scheduling in terms of processor cycles would be prohibitively expensive and would severely impact the packet budget with no real value added in return.

2) Memory management is usually very simple, as it relies on the usage of pre-allocated buffer pools with all the buffers from the same pool having the same size. There is usually no need to support the dynamic allocation/release of variable size memory blocks, as implemented by the classical malloc/free mechanism.

3) File management is not required, as typically there is no file system.

4) Device management is usually done through the use of low-level device API functions. The set of existing devices is fixed (network interfaces, accelerators) and small, there is no need to discover the peripherals at run-time or to support hot-pluggable devices.

As there is little commonality among the fast path devices, there is little practical gain in implementing a common device interface or a device file system.

Sometimes, out of pure convenience or due to the need to support legacy code, an OS might also be used for the data plane cores. In this case, it might be useful to use a mechanism called para-partitioning to create two separate partitions for the control plane and the data plane respectively.

This mechanism requires firmware support to partition the resources of a single physical system into multiple logical systems while still maintaining a 1:1 mapping between the logical and the physical resources. Each partition boots its own OS which is aware only of the resources statically assigned to it by the firmware.

Pipeline versus cluster modelling
In a networking application it is important to assess carefully the programming model that is used, among them the pipeline and cluster models.

Pipeline model. In the model (Figure 1, below ), each stage of the packet processing pipeline is mapped to a different core/thread, with the packet being sent from one stage to the next one in the pipeline.

Figure 1. The Pipeline Model

Each core/thread has its fixed place in the pipeline and is the owner of a specific functionality which it applies on a single packet at a time, so the number of packets currently under processing is equal to the number of pipeline stages.

It is often the case that certain functionalities require more processing power than a single core can provide. The way to work around this problem in the pipeline model is to split the associated processing over several pipeline stages, each one executing just a subset of the operations of the initial stage.

The cluster model. The disadvantages of the pipeline model are addressed by the cluster model (Figure 2, below ), which combines several cores/threads into a cluster running the full (i.e. not fragmented) functionality of the packet processing pipeline.

Figure 2. The Cluster Model

All the cores/threads within the cluster execute the same software image on packets read from the same input stream and written to the same output stream. From the outside, the number of cluster members is transparent and therefore the cluster looks like a single super-core. The full functionality of the packet processing pipeline is applied on the input packets by a single stage (the cluster) as opposed to sending the packets from one stage to another.

However, the cluster model is not problem-free, as it introduces the delicate problem of synchronization between the cluster members when accessing the shared resources of the cluster like the input/output packet streams, the shared data structures, etc.

Figure 3. The Hybrid Model: Pipeline of Interconnected Clusters

Of course, the hybrid model (Figure 3, above ) is also possible, in which case the advantages of both models are combined by mapping the packet processing pipeline to a pipeline of interconnected clusters.

To read Part 2 in this series, go to Minimizing and hiding latency.

Cristian F. Dumitrescu is a Senior Software Engineer with the Embedded and Communications Group at Intel. He is the author of Design Patterns for Packet Processing Applications on Multicore Intel Architecture Processors from which this article is derived. He has worked extensively in the past with network and communications processors from AMCC, C-Port, Freescale and Intel. He is currently focusing on delivering packet processing performance on multi-core Intel Architecture processors.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.