Making packet processing more efficient with network-optimized multicore designs: Part 2

Cristian F. Dumitrescu

February 5, 2010

Cristian F. Dumitrescu

Using accelerators to offload operations
One of the roles of accelerators is to reduce the latency of the complex or memory intensive operations. For example, by using a specialized hardware block, the decryption (or encryption) of the current packet can be performed in just a few clock cycles, which is not possible when implementing the same operation with a general purpose instruction set.

The complex mathematical operations associated with cryptography, if not offloaded to a specialized engine, can effectively choke the packet processing cores.

The other role that may be assigned to accelerators is to hide the latency of the I/O or memory intensive operations from the cores in charge of running the packet processing pipeline.

The latency of the operation is not reduced, but by offloading it to a specialized block, the cores are given the chance to do some other useful work, i.e. process some other packets meanwhile and resume the processing of this packet after the result from the accelerator becomes available.

This approach aims at maximizing the core utilization and thus minimizing the number of necessary cores. The alternative would be to block while busy-waiting for the memory transfers to complete, which would require adding more cores/threads just to keep up with the incoming packet stream, which results in an inefficient setup due to big idle time percentage for the cores.

Dedicating specific cores to acceleration
When accelerators are not available to offload a complex operation from the processing cores, it is always a good practice to segregate the code that implements this operation from the code that deals with regular processing and assign them to different cores. The cores are basically divided in two separate categories: packet processing cores and I/O cores.

The purpose of this strategy is to eliminate the busy-waiting operations from the processing cores by moving them to dedicated cores. The performance of the overall system becomes more predictable, as the number of cores dedicated to I/O processing is a direct consequence of the number of I/O operations per packet, which can be determined upfront. This way, the regular processing is not prevented by busy-waiting from utilizing the core idle time.

The I/O cores can be looked at by the packet processing cores as accelerators, as the I/O cores are effectively implementing complex or memory intensive tasks which are this way offloaded from the packet processing cores.

The communication between the two can be implemented with message passing with the packet processing cores (the clients) sending requests to the I/O cores (the servers) and eventually receiving responses from them when the result of the operation becomes available.

One notable example feasible to implement using this approach is the preprocessing of traffic received by the network interfaces, as well as its preparation for transmission.

On the reception side, the network interface (e.g. a Gigabit Ethernet MAC) receives a packet from the network, stores it into a memory buffer and writes the buffer handle into a ring read by the processor cores.

Building the packet descriptor in a proper format that is understood by the rest of the pipeline might be the job of one or more I/O cores that block while waiting for a new packet to be received by the MAC and implement the low-level handshaking mechanism with the MAC. Similar processing is required on the transmission side.

To support higher bandwidth (e.g. 10 Gigabit Ethernet), the Intel MACs, for example, support a feature called Receive Side Scaling (RSS), which applies a MAC built-in hashing function on selected fields of the input packet to uniformly generate the index of an I/O core out of a pool of cores assigned for this purpose. This way, a fair load-balancing between the cores is performed in order to achieve the high input bandwidth of the network interface.

Another good example is the implementation of the table lookup operation used during the classification and the forwarding stages. The software algorithms for LPM require several memory accesses into tree-like data structures for each table lookup.

When the I/O core implementing it receives a request to find the LPM match for a specific key (IP address plus prefix), it performs the sequence of memory reads while blocking after each read until completed. When the final result is available, it is returned as the response to the processing cores. Now it is time to look at the efficiency of the various flavors of the cluster model.

< Previous
Page 2 of 3
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER