Making packet processing more efficient with network-optimized multicore designs: Part 2Apart from common operations performed by any network processing intensive application (see Part 1), packet processing involves some specific operations that cannot be efficiently implemented with a general purpose instruction set. Typical examples include:
Table lookup using algorithms like Longest Prefix Match (LPM) or range matching, required during the classification or forwarding stages of IPv4 routing applications.
Pattern matching, used by deep packet inspection applications that filter the stream of incoming packets by applying regular expressions stored in the rule data-base over the packet payload.
Cryptography, used by security applications to decrypt/encrypt the payload of incoming/outgoing packets.
CRC or checksum calculation, used by many networking protocols to indicate the integrity of the packets received from the network.
The size of the available on-chip memory is usually small, if any. As a result, some operations require a significant number of accesses to external memory.
Although several techniques are available to minimize their number (e.g. by using on-chip cache memories) or optimize them (e.g. by using DMA engines or memory bank interleaving), the penalty incurred by these accesses cannot be completely eliminated.
Their impact on the packet budget is significant, and sometimes the memory accesses per packet can even exceed the budget of processor cycles per packet. Examples of such operations are:
1) Reading the packet descriptor to local memory and writing it back to external memory (header caching); and,
2) Software implementations of table lookup algorithms.
To work around these two problems, several measures can be considered. Some of these are described below.
Use parallel processing to meet the packet budget
When using the pipeline model, each stage of the pipeline must still meet the packet budget, but the amount of processing that needs to fit into this number of clock cycles is limited to the one of the current pipeline stage as opposed to the full packet processing implemented by the entire system.
As a result of each pipeline stage processing a different packet, the packet budget is effectively multiplied with the number of pipeline stages.
When using the cluster model, the cluster is still required to process one packet per each packet budget interval, but by using internally several cores running in parallel, each one processing a different packet, the packet budget is effectively multiplied with the number of cores within the cluster.
Hiding the latency of accesses to external memory
No increase in the processing power is achieved by simply using the multi-threading. As the threads run in time sharing mode, all threads running on the same core use up a share of the processing cycles of their core.
However, multi-threading can be an important instrument in minimizing the waste of processing cycles of the core. In a cooperative multi-threading system, the current thread can decide to relinquish the control of the core to other threads immediately after sending a request to the external memory controller.
The memory access takes place while the thread which launched it is dormant and other threads are using the core to perform meaningful work on other packets. The thread will not resume the processing of the packet until the memory access has completed, thus the core is not wasting processing cycles as it does not busy-wait for the memory access to complete.
Therefore, as no increase in the processing power is achieved, multi-threading cannot minimize the latency of complex operations, but it can be an effective mechanism to hide this latency from the cores and thus increase the overall efficiency of the cores.