Network processors are designed for speed, but programming them is often a challenge. Here, you'll find the information you need to write software that makes the most of them.
Network processors are specialized CPUs optimized to support the implementation of network protocols at the highest possible speed. The overarching emphasis on speed results in unconventional hardware architectures that create new challenges for the software engineer.
This article discusses the forces that influence the design of software running on and communicating with network processors. It also explores why network processors are needed and examines the common multiprocessor architectures underlying the variety of network processors. In it, I will discuss software architectures that are effective for network processors, and look at functionality implemented in network processor software, as well as network processor programming languages. I will address the challenge of portability and the role of industry standards in meeting that challenge. Finally, I will highlight the impact of network processors on the architecture of surrounding software such as protocol stacks.
Where are network processors used?
A network processor is used in a network traffic manager, which occupies the space between a network interface and a switch fabric in a switcher/router. The traffic manager decides where, when, and how incoming and outgoing data will be sent next. It strips, adds, and modifies packet headers. It also makes routing and schedule decisions. The traffic manager has interfaces to the network and to the switch fabric. In Figure 1, these are labeled PHY (physical interface) and CSIX (common switch interface) respectively.
Figure 1: Traffic manager context
Early traffic managers were built around a general purpose processor (GPP). The GPP was supported by a direct memory access controller (DMAC) and simple I/O devices. Traffic was transferred in protocol data units (PDUs) between memory and the switch fabric or network interface. The GPP accessed each PDU and programmed the peripheral devices to dispose of it, as shown in Figure 2.
Figure 2: Early traffic manager architecture
This architecture changed as network speed outpaced processor and bus speed. The switch fabric interface and network interface were integrated into a single application-specific integrated circuit (ASIC) to allow PDUs to be transferred without passing over the system bus. This is illustrated in Figure 3.
Figure 3: ASIC-based traffic manager
This new architecture meant that control of individual PDUs was delegated to the ASIC. The ASIC ran hard-wired network protocols. It passed the majority of traffic through, transferring to the GPP only those PDUs involved in control or signaling, or those that required unusual processing. Network processors are designed to replace the fixed-function ASIC, adding software programmability to wire speed processing.
Why do we need network processors?
Network processors are powerful devices and a challenge for embedded software engineers. They come in a bewildering variety of architectures, but they share a defining characteristic: they sit in high-speed data paths and they manipulate network data at sustained speeds of gigabits per second in software. That's a tall order and it begs two questions.
First, why is speed suddenly a problem when silicon is always getting faster? Second, why do network processors have to be implemented in software? It is established practice to improve speed by migrating functionality from software to silicon. Network processors go against this trend.
Speed is a problem because the bandwidth of optical fiber is growing at an even faster rate than the speed of silicon. The relentless improvement of semiconductor performance is legendary and it is surprising to learn that other technologies can eclipse it. Figure 4 shows the trends of network and silicon speed over recent years. In the nine years shown, CPU clock speed increased by a factor of 12, while network speed increased by a factor of 240. The exponential rate of network bandwidth growth is expected to continue because it is a long way from fundamental barriers.
Figure 4: Network bandwidth vs. silicon speed
Another reason why speed is a problem is the amount of processing that has to be done on data. Consider IP packet processing. Traditionally, an IP packet needed very little processing-decrement the time to live counter, recalculate the header checksum, and choose a route based on the destination address. That was before real-time data, quality of service (QoS), and security (IPsec). Routing decisions that consider only network topology must be replaced with a complex evaluation of latency, jitter, congestion, bandwidth guarantees, and more.
And that's just routing. Consider the possibilities of firewalling, spam and virus detection, and so on on the wire. Network processors can fight virus epidemics by detecting and eliminating viruses in transit. All in all, an IP packet going through the Internet in a few years time will receive a lot more individual attention than its predecessors did.
The complexity of this individual packet processing is the reason why it needs to be done in software. The new functionality is not only complex, but evolving, and subject to change in the field.
ISPs and airlines
Contrast the service pricing offered by an Internet service provider (ISP) to that offered by a vendor in a more mature market. ISPs offer a basic product with few variants. They provide a connection with a specified maximum bandwidth and no guarantees of bandwidth availability or quality of service.
Airlines, on the other hand, maximize their revenue by selling the same basic product at a lot of different prices. Airlines offer discounts on advance booking, Saturday night stopovers, return flights, and so on. They make special offers at short notice in response to competitive pressure.
To make an ISP's business model more like an airline's will require access to packets on the wire. Traffic must be categorized and measured. Pricing and other policies must be enforced. ISPs need network processors to perform these functions at wire speeds.
What do network processors do?
Network processors manipulate PDUs at wire speed to implement a variety of functions including QoS, encryption, firewalling, and such. Let us consider what that means in terms of implementation. These features are specified as network protocols, so they are implemented in protocol stacks. But network processors do not run entire protocol stacks. Protocol stacks are designed to run on GPPs and GPPs are designed-among other things-to run protocol stacks. The role of the network processor is to implement only those parts of a protocol that require direct access to the data stream. Complex behavior is left to the GPP.
To put it another way, assume that a traffic manager spends 90% of its time running 10% of a network protocol. It makes sense to partition the system with a network processor running the 10% of a protocol that takes 90% of the time, leaving a GPP to run the 90% of the protocol that takes 10% of the time. The network processor's workload boils down to logically simple functionality, such as detecting PDUs that match specified patterns, counting PDUs, and enqueueing PDUs.
So many PDUs, so little time
Because wire speed data arrives at the network processor quickly and must be dispatched just as speedily, the network processor has very little time to operate on a PDU. For example, a 133MHz network processor processing 40-byte IP packets at OC48 speeds (2.4GBps) sees a packet arrive every 21.7 clock cycles. We will see later how the network processor makes the most of this time slot, but, needless to say, it does not have time to execute many instructions per PDU, so it cannot perform complex processing.
We can make some generalizations about what type of processing is done on a PDU between when it is received and when it is retransmitted. First, it is examined to determine what further processing will be done on it. This examination consists of looking at the PDU contents to see which patterns of interest it contains. This process is referred to as classification and it is used in routing, firewalling, quality of service implementation, and policy enforcement.
For example, a network processor might be enforcing a policy that prioritizes an enterprise's internal communications over external web traffic. The first step in this process is to distinguish between the two traffic types.
A PDU may be modified. For example, an IP packet will have its time-to-live counter reduced. In label-switched traffic, an incoming label will be replaced with an outgoing label. Headers may be added or removed. Modification usually entails recalculation of a CRC or checksum.
Re-transmission of a PDU is not generally straightforward. Some PDUs may be prioritized over others. Some may be discarded. Multiple queues may exist with different priorities.
Classification, modification, queueing, and buffer management are some of the wire speed operations that a network processor may perform. Others include security (encryption, decryption, and authentication), policing, compression, and traffic metrics.
How do network processors work?
With network bandwidth relentlessly outpacing silicon speed and with perpacket-processing going through the roof, what kind of processor architecture can address the challenge? A network processor architecture must extract a high level of performance per gate of silicon. In pursuit of this, it compromises on issues such as ease of programming.
Consider how much silicon a modern GPP such as a Pentium or a PowerPC devotes not just to achieving speed, but to preserving ease of programming. The GPP takes a sequence of instructions, pipelines it, analyzes and dissects it, executes multiple instructions in parallel, manages multiple copies of registers, caches memory, and produces an illusion that the instructions were executed one after the other. Network processors eschew these complexities and use the saved silicon for other purposes.
Although a wide variety of network processor architectures exist, they share some common themes. One theme is multiprocessing. A network processor contains not one, but many individual processors, which range in complexity from quite limited to C++ programmable RISC processors. They use different strategies to divide up processing and they have different internal data flows. The terms that vendors use for them also vary, from pico-processor to RISC core. In this article, I will use the term processing element (PE) to refer to the individual processing units of a network processor.
A multiprocessor architecture has the potential to multiply the amount of processing time that a network processor can devote to a PDU by the number of PEs. The network processor designer trades off the size and power of a PE with the number of PEs that can be included in the device, whereas the GPP designer uses all available silicon to make a single processor as fast as possible. A multiprocessor composed of many less powerful PEs has more processing power in the device than if the resources were dedicated to implementing a single processor.
The sophistication and ease of programming of PEs varies greatly among different devices. The Motorola (formerly C-Port) C5 network processor has 16 RISC cores and is supported by a C++ compiler. IBM's PowerNP device, on the other hand, has eight protocol processing units, containing two picocode engines. A picocode engine can run two threads with zero context switch time, but it must be programmed in assembly language using a detailed knowledge of the processor architecture.
In addition to multiple PEs, network processors contain narrow focus coprocessors for tasks such as pattern matching, table lookup, and checksum generation.
Figure 5 illustrates the architecture of a typical network processor in a traffic manager. It represents PEs and co-processors as small boxes inside the network processor.
Figure 5: Network processor architecture
Figure 6: Pipelined processing element architecture
Figure 7: Parallel processing element architecture
Network processors are multiprocessors and we will now address the question of how to divide the processing among the different PEs. The answer depends strongly on the hardware architecture, particularly how the PEs are connected together and how data is passed between them.
The workload on a network processor is determined by the arrival rate of PDUs. The device must process PDUs as fast as they arrive (at a specified arrival rate). The behavior of the network processor tends to depend strongly on the content of the PDU. This is evidenced by the fact that classification is the first step in many network processor operations.
A basis for division of processing
If it implementing the algorithms in parallel were easy, then we could apply N PEs to processing a PDU and have it processed N times faster. But we have to implement many algorithms and most are not easy to parallelize. They do, however, share one property that can be exploited by a multiprocessor: they operate on one PDU at a time. If a PE can be dedicated to processing one PDU at a time, a multiprocessor and its software can be designed to process N PDUs in the time it takes to process one, rather than try to process one PDU N times faster.
One way to achieve this is to arrange PEs in parallel, with a PE processing a PDU from start to finish. The other way is to arrange the PEs in a pipeline, with each stage performing partial processing on an individual PDU. We will discuss these configurations in detail later, but first we must consider the fact that they both result in higher latency. It is generally much easier to increase the throughput of a system if it is possible to compromise on latency. Before we can adopt this simplifying strategy, we must be satisfied that the impact in latency is acceptable. The question arises as to whether it is acceptable to increase the processing time of a PDU by a factor of N, where N is the number of PEs in the network processor.
To answer this, we must judge what latency is acceptable. This depends on the type of traffic. For video and audio traffic, latency is relatively unimportant. It is limited by human perception. For a telephone conversation, one-way latency up to 150 milliseconds is imperceptible. For one-way video or audio traffic, latency of several seconds is acceptable, since it is only perceived at the start of the stream. Latency requirements for computer data vary. Web traffic can have higher latency than telephone traffic and FTP traffic can have much higher latency. Other data traffic, however, is sensitive to latency. The runtime of a distributed program may be dominated by latency. In the worst case, network processor latency will be acceptable if it does not contribute significantly to the total end-to-end latency of the PDU.
Electrical signals on a wire, optical signals in a fiber, and radio signals are all examples of electromagnetic radiation; all propagate at the speed of light in the medium. The speed of light in the medium ranges from about 2/3c to c, where c is the speed of light in a vacuum (about 3×108 meters per second). Propagation delay in a local network is on the order of 1s, while traffic between Internet nodes 6,000km apart will experience a propagation delay of 20ms to 30ms.
Other contributions to latency include buffering at routers and switches, and software overhead at the source and destination. These delays typically combine to exceed 10ms. You can run the ping utility on the LAN and on the Internet to confirm this.
A network processor propagation delay that is small compared to 10ms would be acceptable except for one thing. There may be several network processors in a data path. Assuming that no more than ten network processors are in a network path-a large figure, since network processors are found at network edges-and assuming that a combined contribution of 1% to propagation delay is acceptable, we can accept a propagation delay of 10s per network processor.
How many PEs can we use in a network processor?
This affects the number of PDUs we can process simultaneously in several ways. We can process simultaneously the number of PDUs that arrive in a 10s interval, allowing a processing time of 10s for each. If we assign one PDU to one PE at a time, this is equal to the number of PEs that we can use in the network processor.
The arrival rate of PDUs is equal to the network bandwidth multiplied by the number of bits in the PDU, including overhead. The earlier example of a 40-byte IP packet over PPP over Sonet gives a PDU size of 406 bits. For a given PDU size, the number of PDUs we can process simultaneously is proportional to network bandwidth because acceptable latency remains the same. This means that a network processor architecture based on PEs that each handle one PDU at a time can, in principle, handle increasing bandwidth just by increasing the number of PEs. Table 1 shows how many PEs can be usefully included in a network processor at various speeds with a minimum PDU size of 40 bytes and a network processor latency of 10s.
|Table 1: Multiprocessor scaleability|
|Standard||Data rate (Gbps)||40-byte PDUs per 10s|
This is somewhat optimistic, however, since it assumes that no overhead is incurred as the number of PEs increases. More realistically, factors limiting multiprocessor performance become dominant as the number of PEs increases. Factors include contention for shared resources and increased divergence of PDU processing times. Because of these limitations, network processor vendors plan architectural changes for future generations of network processors as well as to increase the number of PEs.
For reasons we have discussed, PEs are organized within network processors around the processing of PDUs. A PE receives a PDU, operates on it, and passes it on. The device architecture determines the flow of PDUs through it.
There are two basic ways to deliver PDUs to PEs. One is a pipeline where the PEs are arranged serially. The PDU is delivered to the first PE in the pipeline. It does partial processing on the PDU, passes it on to the next PE, and starts work on the next PDU. There is a PDU in each pipeline stage, so N PDUs are being processed simultaneously. The pipeline architecture is difficult to load balance because the rate of progress of the entire pipeline is determined by the slowest stage. On the other hand, the pipeline organization means that each PE is doing a different part of the processing, so it can be optimized for that task. Anticipating that classification-which determines what the network processor does with a PDU-occurs early in processing, a network processor designer may optimize the first pipeline stage accordingly.
The other basic way to organize PEs is in parallel. A PE receives a PDU, does all processing on it, and dispatches it. This architecture is better for load balancing than the pipelined architecture because each PE has the same work to do. Processing time can still vary, of course, because it depends on PDU contents. Although the architecture can accommodate PEs finishing out of order, this leads to complications with queueing and internal event ordering.
Other issues distinguishing pipelined and parallel architectures include contention for shared resources and ease of programming.
Data and control planes, fast and slow paths
Much of the behavior of a network processor is subject to control and configuration. A classifier function must be told what patterns to detect and a queue management function must have its queues specified. Routing tables need to be updated. Control and configuration parameters originate either in policy decisions or in network protocols, but they are usually conveyed to the network processor by the GPP. The network processor is said to operate in the data plane and the GPP is said to operate in the control plane.
Information also flows from the data plane to the control plane. The network processor may deliver signaling PDUs to the GPP. It may gather statistics that are returned to the GPP. It may notify the GPP of error conditions.
Although signaling PDUs travel into and out of the traffic manager through the network processor along with other PDUs, they are different in that they are usually handled by the GPP. PDUs that are handled by the GPP are said to travel the slow path, while the majority that enter and exit without being seen by the GPP are said to travel the fast path.
Some non-signaling PDUs also travel the slow path. A network processor may delegate PDUs with unusually complex processing to the GPP, to reduce complexity and size of the network processor code. This tactic also prevents difficult PDUs from reducing the ability of the network processor to handle its normal workload.
Portability and the CPIX API
Network processors impact the design and implementation of applications such as network protocols and network management software. These applications are traditionally implemented in C or C++ with a fairly simple model of the processor on which they are running, which makes them portable across different processors. To make use of a network processor, the software must be divided into a control plane running on the GPP and a data plane running on the network processor. An interface is also needed between the control plane and the data plane.
The data plane software depends heavily on the network processor architecture and tools. The network processor is fast and dumb compared to the GPP. It knows nothing of routing algorithms or policies. It only understands simple things such as patterns, tables, and queues.
At first sight, portability might seem unachievable. But consider that the control plane software, which can still be portable, constitutes the vast majority of the software. The data plane software consists of small nuggets in the protocol or network management. The Network Processing Forum is developing an API, called the CPIX API, between the control plane and the data plane, which is to be supported by the majority of network processors. It aims to model a diverse range of network processors with a common abstraction. So the prospect exists of writing portable GPP software that abstracts a network processor behind an industry standard API. The software underneath the API would be supplied by the network processor vendor and can be well tuned to the network processor architecture. Software above the API can exploit a network processor while retaining portability.
If the CPIX API embodies the right abstractions, it can hide the architecture of the network processor. This allows GPP software (such as a network protocol implementation) to be independent of the network processor in the system and it also gives the network processor vendor more freedom to alter the network processor architecture.
The CPIX API is divided into functional blocks. Each block abstracts an aspect of network processor functionality. The list of blocks is not yet definitive, but it is expected to include ingress port, parsing and searching (classification), policing and shaping, modification, and egress port. (See www.networkprocessingforum.org for the latest information.)
CPIX is at an early stage and the API is only partially defined. It remains to be seen whether CPIX will be powerful and flexible enough to leverage network processors for still undefined protocols and services.
However, one thing is clear. Network protocol implementations and other applications that might use network processors need to be designed appropriately. The software architecture needs to distinguish between the control plane and the data plane, so that the data plane processing can easily be isolated for implementation on a network processor.
How are processing elements programmed?
The majority of embedded system software is now developed in C or C++. From the discussion so far, the reader may suspect that the picture for network processors is a bit different. Many PEs have an architecture that is not a good target for C and C++. One, for example, lacks support for pointers in its instruction set. The architectural mismatch reflects a design philosophy that is quite different from the design philosophy of a GPP. While a GPP is designed to run C/C++ as fast as possible, a PE is designed to process PDUs as fast as possible for a given amount of silicon resources.
A role for assembly
Some network processor vendors consider C++ language support unimportant given the small quantity of code that runs on the PE and the architectural sacrifices that would have to be made to conform to a C++ programmer's model. Most network processors run only a few kilobytes of code, a quantity that can be done fairly easily in assembly language.
Other vendors, however, have a C++-friendly architecture and offer a C++ compiler. These vendors take the view that ease of programming is important to achieving quick time to market (and time to market is, after all, one of the reasons why network processors are in demand in the first place).
There is another option apart from C++ and assembly. Classification is an important part of network processor software and it is a part that is not well suited to writing in C++. C++, C, Java, Pascal, and Fortran are all examples of what are termed imperative programming languages. In an imperative programming language the program tells the processor what to do and in what order to do it. Statements are executed one after the other, except where otherwise specified. The order of execution of statements is unambiguous and defined by the programmer.
Another style of programming is functional programming. In functional programming, the program specifies what happens under specified conditions. The order in which statements are executed is determined by the order of occurrence of these conditions. Functional programming languages have been devised for general-purpose software development and have been studied for many years in academia. Because a functional program does not specify order of execution, it can be simpler and less error-prone than an imperative program.
A widely used functional programming language that may not be recognized as such is the language accepted by lex, the lexical analyzer that is part of most Unix distributions. Lex and its companion yacc form the basis of the front end of many compilers and interpreters, including C++ compilers. Lex supports a functional language that specifies patterns to be detected in the input file and a fragment of C to be executed when the pattern is encountered. Lex generates a C file that incorporates logic for pattern detection with the C fragments inlined. This file is compiled to form the parser for the compiler or interpreter.
Figure 8 compares the use of lex with the use of a pattern description language (PDL). The upper half shows a .l file being processed by lex and a C compiler to a .o file. The lower half shows a PDL file being used by a PDL compiler to produce a configuration file for a pattern-matching engine. The pattern-matching engine takes a stream of PDUs and emits a stream of actions, just as the parser takes a stream of source code and emits a stream of tokens.
Figure 8: Comparison of lex and PDL
Lex has been an accepted tool for this role for many years now because it is easy to use and it results in software that is typically more efficient, as well as easier to write, than equivalent functionality of an imperative language.
The significance of this for network processors is in classification. Classification is also a pattern-matching task. The precedent of lex shows that