Back to the basics: Using a general purpose CPU for network control and data plane operations -

Back to the basics: Using a general purpose CPU for network control and data plane operations


In most embedded networking and communications designs, the primary operations the main processor must provide involve conversion and forwarding of serial traffic (streams, cells, or packets) from one communications interface (port) to another.

This requires that the processor be able to handle the requirements of both the flow- optimized data plane and the less time critical control plane. Data plane functions are defined as functions critical to the traffic flow rate and include traffic management, data transformation, flow classification, data parsing, media access control and physical layer operations.

Control plane functions are maintenance functions that are not directly required to forward traffic through the system. These include topology management, signaling, network management and policy applications such as provisioning, billing and security. Control plane functions, therefore, can usually be performed at a lower priority than data plane functions for a given system performance.

In most cases, the most flexible approach is to use a general-purpose processor (GPP) design. In this approach, the data traffic is streamed into the processor complex's memory and the GPP performs the data plane functions listed previously. Commonly, a GPP-based approach comes in the form of a communications processor that integrates a GPP with some processor complex functionality (memory controller, interrupt, GPIO) along with communications-specific hardware (MAC, data steaming interfaces, protocol specific acceleration).

In most other approaches, the data plane and control plane functionality is divided in the hardware, even if the control processor happens to be in the same chip.

With the GPP/communications processor approach, the primary processor typically performs both control and data plane functionality, providing a single programming model (Figure One below ) for both types of functionality. Because the memory is unlimited, the level of potential functionality is virtually unlimited. Adapting functionality as requirements change is also easiest with this approach. The primary design concern with this approach is achieving the desired performance.

Extracting Maximum Data Plane Performance
In order to extract the maximum data plane performance from an general purpose communications processor, attention must be paid to several critical design parameters involving cache size, packet processing requirements, cache coherency, memory configuration, and cache allocation.

All modern general-purpose processors have at least one level of caching. Cache allows the code and data that are critical for data path forwarding to be locally accessed by the core, speeding up the time the core spends processing the traffic. Obviously, larger caches allow more code and data to be local to the core—and more is better. Also, some cache architectures allow certain areas of memory to be locked into cache. Typically this feature is enabled by running the “normal” path of data plane code to load the cache and then applying the lock. More locking options and smaller granularity of lockable sections translate to more flexibility for the data plane code designer.

Some communications processors include a RISC processor-based packet processor. Typically, at a minimum this will perform MAC layer functionality as a general-purpose processor cannot efficiently perform this task. The MAC functional interface within a communications processor is usually through easy-to-use, efficient ring buffer management of the transmit and receive buffers for traffic flow and a configurable set of parameters for initialization.

This implies that minimal general-purpose processor overhead is required to transfer the traffic between the media interface and processor memory space. A high-performance communications processor approach dictates that some “packet specific” programmable resource be available to allow the option of performing the complete list of data plane functions without general-purpose processor assistance.

A caching scheme is required to extract the necessary performance from a general-purpose process to be use for data plane processing. In addition, multiple master devices are required to offload at least the MAC layer function and possibly other data plane functions (in a high-performance communications processor).

The combination of a cache and multiple masters implies that data coherency must be considered. Some communications processors include hardware dedicated to maintaining the coherency, including all levels of cache and all possible memory spaces. This feature is superior to requiring either software-enforced coherency (consumes cycles that could be used for data plane functionality) or memory allocation schemes (places constraints on memory usage to attempt to avoid coherency issues).

High performance memory considerations
Some communications processors include a moderate (up to 1Mbyte) amount of built-in low-latency memory. This feature can be leveraged by data plane functions a couple of different ways to increase system performance.

As discussed previously, many communication processors include MAC layer functionality that is invoked utilizing a ring of buffer descriptors. Servicing these buffer descriptors can consume processing time critical to data plane performance.

Placing these descriptors in a low-latency memory, especially one local the processing core, can significantly speed up the efficiency of managing these descriptors.

Another task that can be sped up by a low-latency memory is classification, which usually requires several accesses to a table residing in memory to perform a lookup algorithm. By creating and managing the table in local memory, the amount of time spent on this task is reduced.

Note that in many network processor implementations the classification task may consume so much of the packet processing resource that an external lookup engine is required. Also note that managing and maintaining the lookup table is a task that typically is considered a control task; a network processor approach will likely require replicating the lookup table in the both control processor and data processor memory areas and may require additional overhead to ensure that the tables in both areas are properly synchronized.

Pre-emptive cache allocation
Most of the data plane processing activity is performed on a small portion of the protocol data unit (PDU), typically just the media and protocol header. Configuring the I/O traffic flow to allocate the header into cache as the PDU is received can result in significant performance improvement. In some communications processor architectures, this technique is referred to as stash caching.

The user defines the size and location of the header area of the PDU during I/O port initialization. As a PDU is received, this feature ensures that the header portion of the PDU will be copied into cache, even if the memory location has not been previously allocated; thus the allocation is forced. The GPP core then has direct, low-latency access to the information it required to perform the remainder of the data plane functions (parse, classification, transformation, etc.).

Useful processor performance benchmarks
It is important that the general-purpose processor core have enough performance not only to perform the control plane functions of the system, but also to contribute to the data plane portion. Emerging communications processors have general-purpose cores with operating frequencies that approach 1.5 GHz.

However, the system designer should consider more than the operating frequency when considering the performance of the general-purpose processor within a system. Typically, a superscalar (multiple instruction units) processor with features, such as branch prediction, sophisticated memory management unit (MMU), and out of order instruction execution, will outperform simpler general-purpose processors.

Useful benchmarks for general-purpose processor performance include Dhrystone, which can provide an idea of the number of instructions per clock possible, and the EEMBC networking and telecommunications application benchmark. The silicon vendor should be able to provide benchmarking data for processor performance.

Other data plane acceleration issues
High performance communications processors may also have specific data plane hardware acceleration features. These hardware acceleration blocks target specific functionality and are typically used with simple parameter initialization and/or ring buffer management techniques (as in the MAC example).

In contrast, dedicated network processors require specific (re)programming to achieve similar functionality. Examples of these acceleration features include encryption/authentication, CRC validation/generation, traffic management/queuing support and classification co-processing.

David Smith is a senior field Applications engineer at Freescale Semiconductor and is based in Raleigh, NC.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.