Making packet processing more efficient with network-optimized multicore designs: Part 2 -

Making packet processing more efficient with network-optimized multicore designs: Part 2


Apart from common operations performed by any network processing intensive application (see Part 1 ), packet processing involves some specific operations that cannot be efficiently implemented with a general purpose instruction set. Typical examples include:

Table lookup using algorithms like Longest Prefix Match (LPM) or range matching, required during the classification or forwarding stages of IPv4 routing applications.

Pattern matching , used by deep packet inspection applications that filter the stream of incoming packets by applying regular expressions stored in the rule data-base over the packet payload.

Cryptography , used by security applications to decrypt/encrypt the payload of incoming/outgoing packets.

CRC or checksum calculation , used by many networking protocols to indicate the integrity of the packets received from the network.

The size of the available on-chip memory is usually small, if any. As a result, some operations require a significant number of accesses to external memory.

Although several techniques are available to minimize their number (e.g. by using on-chip cache memories) or optimize them (e.g. by using DMA engines or memory bank interleaving), the penalty incurred by these accesses cannot be completely eliminated.

Their impact on the packet budget is significant, and sometimes the memory accesses per packet can even exceed the budget of processor cycles per packet. Examples of such operations are:

1) Reading the packet descriptor to local memory and writing it back to external memory (header caching); and,

2) Software implementations of table lookup algorithms.

To work around these two problems, several measures can be considered. Some of these are described below.

Use parallel processing to meet the packet budget
When using the pipeline model, each stage of the pipeline must still meet the packet budget, but the amount of processing that needs to fit into this number of clock cycles is limited to the one of the current pipeline stage as opposed to the full packet processing implemented by the entire system.

As a result of each pipeline stage processing a different packet, the packet budget is effectively multiplied with the number of pipeline stages.

When using the cluster model, the cluster is still required to process one packet per each packet budget interval, but by using internally several cores running in parallel, each one processing a different packet, the packet budget is effectively multiplied with the number of cores within the cluster.

Hiding the latency of accesses to external memory
No increase in the processing power is achieved by simply using the multi-threading. As the threads run in time sharing mode, all threads running on the same core use up a share of the processing cycles of their core.

However, multi-threading can be an important instrument in minimizing the waste of processing cycles of the core. In a cooperative multi-threading system, the current thread can decide to relinquish the control of the core to other threads immediately after sending a request to the external memory controller.

The memory access takes place while the thread which launched it is dormant and other threads are using the core to perform meaningful work on other packets. The thread will not resume the processing of the packet until the memory access has completed, thus the core is not wasting processing cycles as it does not busy-wait for the memory access to complete.

Therefore, as no increase in the processing power is achieved, multi-threading cannot minimize the latency of complex operations, but it can be an effective mechanism to hide this latency from the cores and thus increase the overall efficiency of the cores.

Using accelerators to offload operations
One of the roles of accelerators is to reduce the latency of the complex or memory intensive operations. For example, by using a specialized hardware block, the decryption (or encryption) of the current packet can be performed in just a few clock cycles, which is not possible when implementing the same operation with a general purpose instruction set.

The complex mathematical operations associated with cryptography, if not offloaded to a specialized engine, can effectively choke the packet processing cores.

The other role that may be assigned to accelerators is to hide the latency of the I/O or memory intensive operations from the cores in charge of running the packet processing pipeline.

The latency of the operation is not reduced, but by offloading it to a specialized block, the cores are given the chance to do some other useful work, i.e. process some other packets meanwhile and resume the processing of this packet after the result from the accelerator becomes available.

This approach aims at maximizing the core utilization and thus minimizing the number of necessary cores. The alternative would be to block while busy-waiting for the memory transfers to complete, which would require adding more cores/threads just to keep up with the incoming packet stream, which results in an inefficient setup due to big idle time percentage for the cores.

Dedicating specific cores to acceleration
When accelerators are not available to offload a complex operation from the processing cores, it is always a good practice to segregate the code that implements this operation from the code that deals with regular processing and assign them to different cores. The cores are basically divided in two separate categories: packet processing cores and I/O cores.

The purpose of this strategy is to eliminate the busy-waiting operations from the processing cores by moving them to dedicated cores. The performance of the overall system becomes more predictable, as the number of cores dedicated to I/O processing is a direct consequence of the number of I/O operations per packet, which can be determined upfront. This way, the regular processing is not prevented by busy-waiting from utilizing the core idle time.

The I/O cores can be looked at by the packet processing cores as accelerators, as the I/O cores are effectively implementing complex or memory intensive tasks which are this way offloaded from the packet processing cores.

The communication between the two can be implemented with message passing with the packet processing cores (the clients) sending requests to the I/O cores (the servers) and eventually receiving responses from them when the result of the operation becomes available.

One notable example feasible to implement using this approach is the preprocessing of traffic received by the network interfaces, as well as its preparation for transmission.

On the reception side, the network interface (e.g. a Gigabit Ethernet MAC) receives a packet from the network, stores it into a memory buffer and writes the buffer handle into a ring read by the processor cores.

Building the packet descriptor in a proper format that is understood by the rest of the pipeline might be the job of one or more I/O cores that block while waiting for a new packet to be received by the MAC and implement the low-level handshaking mechanism with the MAC. Similar processing is required on the transmission side.

To support higher bandwidth (e.g. 10 Gigabit Ethernet), the Intel MACs, for example, support a feature called Receive Side Scaling (RSS), which applies a MAC built-in hashing function on selected fields of the input packet to uniformly generate the index of an I/O core out of a pool of cores assigned for this purpose. This way, a fair load-balancing between the cores is performed in order to achieve the high input bandwidth of the network interface.

Another good example is the implementation of the table lookup operation used during the classification and the forwarding stages. The software algorithms for LPM require several memory accesses into tree-like data structures for each table lookup.

When the I/O core implementing it receives a request to find the LPM match for a specific key (IP address plus prefix), it performs the sequence of memory reads while blocking after each read until completed. When the final result is available, it is returned as the response to the processing cores. Now it is time to look at the efficiency of the various flavors of the cluster model.

Run-to-completion cluster model
This model assumes the existence of multi-threading on the processing cores. Each packet entering the cluster is assigned a single thread that processes the packet completely, i.e. it handles the packet from the moment it enters the cluster until it exits the cluster. No other thread can process this packet.

In a cooperative multi-threading system, the core applies the regular processing on the packet and sends it to specialized accelerators for complex operations while releasing the control of the processor to other threads processing different packets.

The processing of the current packet is resumed by the same thread after the result from the accelerator becomes available. As the packet descriptor is accessed by a single thread, this model has the advantage that the packet descriptor can be stored in the private memory of the assigned thread.

The disadvantage is that the number of packets that can be processed at a given time cannot exceed the number of available threads, so when all the threads of a core are dormant waiting for response from the accelerators, the core cannot read new packets from the input stream even though it is currently idle.

Request-based Model
This model does not require multi-threading. It uses the processing cores as dispatchers between accelerators. Each core reads a different packet, applies the regular processing and sends the packet to specialized accelerators for complex operations. While the accelerator is working, the core goes to process another packet.

The difference from the previous model is that any core can resume the processing of the current packet once the job of the accelerator is completed. This approach assumes that all the processing cores are sharing the interrupt requests coming from the accelerators.

This model has the advantage that the number of packets under processing is not limited by the number of available threads, but it has the drawback that the packet descriptors must be stored in the shared memory space to make them available to all the cores. Table 1 below summarizes the differences between the two models.

Table 1. Run-to-completion vs. Request-based

The data plane's Inter-core Synchronization Problem
The cluster model introduces the problem of synchronization between the cores/threads for accessing the cluster shared resources like the input and output packet streams, the specialized accelerators and the shared data structures. Similarly, the pipeline model also must handle the synchronization problem for accessing the data structures that are shared between multiple pipeline stages.

Several types of data structures can be identified for packet processing:

Data structures per packet : the packet descriptor.

Data structures per flow/connection : one entry per flow in various tables like flow tables, policy tables, routing tables, etc.

Data structures per protocol : global structures storing the current context (i.e. the state information) for the processing state machines associated with every networking protocol implemented by the application, data structures with statistics counters, etc.

The access to these shared data structures must be protected through semaphores or similar synchronization mechanisms. In general, it is recommended to design the data structures of the application for minimal contention between the cores/threads accessing them.

In some cases, the packet descriptors are not shared between several cores/threads: 1) for the pipeline model, if header caching is used, or, 2) for the cluster run-to-completion model.

The typical communication mechanism between cores/threads or between cores/threads and the accelerators is message passing using interrupt requests or queues of request/response messages. If no hardware support for queues is available, they can be implemented as linked lists protected by semaphores.

When queues of packets are used as the communication mechanism between the various processing entities, an additional requirement may be to preserve the packet order within the same flow. This means that packets belonging to the same flow/connection should exit the processor in the same order they entered the processor.

The most widely used method to achieve this is to assign a unique sequence number to each packet when it enters the processor, allow the packets to be processed out of order for efficiency and have them re-ordered based on their sequence numbers just before their return to the network.

To read Part 1, go to Pipeline versus clustering cores for networking

Next in Part 3 – PRODUCT HOW-TO: Using Intel Multicore CPUs in packet processing

Cristian F. Dumitrescu is a Senior Software Engineer with the Embedded and Communications Group at Intel. He is the author of Design Patterns for Packet Processing Applications on Multicore Intel Architecture Processors from which this article is derived. He has worked extensively in the past with network and communications processors from AMCC, C-Port, Freescale and Intel. He is currently focusing on delivering packet processing performance on multi-core Intel Architecture processors.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.