To read original PDF of the print article, click here.
Internet Appliance Design
The first generation of network processors is finally here.But what are they good for and how do they work?
Major semiconductor manufacturers are starting to sell a new type of integrated circuit, the network processor. Network processors are programmable chips like general purpose microprocessors, but are optimized for the packet processing required in network devices.
Network devices are a growing class of embedded system and include traditional Internet equipment like routers, switches, and firewalls; newer devices like Voice over IP (VoIP) bridges, virtual private network (VPN) gateways, and quality of service (QOS) enforcers; and web-specific devices like caching engines, load balancers, and SSL accelerators.
In this article, I will describe the processing requirements of network devices, how traditional designs meet those requirements, how network processors aim to meet those requirements, and the architecture of a few network processors in detail.Network processing requirements, part 1
Not all network devices have the same processing requirements. However, a lot of similarities exist. As an example, I will roughly describe the packet processing duties of a router and a web switch. These core, time-critical duties are also called data plane tasks.
Routers are the workhorses of the Internet. A router accepts packets from one of several network interfaces, and either drops them or sends them out through one or more of its other interfaces. Packets may traverse a dozen or more routers as they make their way across the Internet. Here is a simplified version of the IP routing algorithm:
Web switches, by contrast, are a new type of network device. They address the problem of trying to increase the responsiveness of a popular Web site by using more than one web server. A web switch can direct incoming HTTP requests to different servers based on a variety of networking parameters, including the URL itself. For instance, all secure HTTP requests could be forwarded to a special web server with cryptographic hardware to accelerate those requests. Here is a simplified web switch algorithm:
Note that, for a given bandwidth, the web switch processing requirements are much higher, and require much more state than the router processing requirements. The difference arises because a router processes packets, but a web switch processes connections.
Network processing requirements, part 2
A variety of less time-critical tasks fall outside the core processing or forwarding requirements of a network device. These are called control plane tasks. For a router, these tasks include routing protocols like OSPF and BGP, and management interfaces like serial ports, telnet, and SNMP. For a web switch, these tasks include receiving updates about the status of web servers and providing a web interface for configuration and management. For both devices, error handling and logging are important control plane tasks.
Another way to distinguish data plane tasks from control plane tasks is to look at each packet's path. Packets handled by data plane tasks usually travel through the device, while packets handled by control plane tasks usually originate or terminate at the device.
Data plane vs. control plane tasks
Using a router as an example, this phenomenon can be considered from two vantages, code size or processing requirements. The data plane tasks of a router were described briefly in the previous section, and a detailed description would not be much longer. It seems apparent that one could handle the data plane tasks without a lot of code.
The control plane tasks were also described, but the description was not nearly as precise. Even in a traditional network device like a router, control task implementations vary. All routers will have code to handle routing protocols like OSPF and BGP, and they will almost certainly have a serial port for configuration. But they may be managed via a web browser, Java application, SNMP, or all three. This can add up to a lot of code. If you're still not convinced, look at the size of Cisco's books on how to configure its routers.
Now, let's consider the packets entering the router. Nearly all of them are addressed to somewhere else, and need to be examined and forwarded there very quickly. For example, for a router to run wire-speed with a 155Mbps OC-3 link, it needs to forward a 64-byte packet in three microseconds. These packets may not need to have much done with them, but it needs to be done in a timely manner.
This requires tight code and a lot of processing power. By contrast, the occasional OSPF packet that causes the routing tables to be updated, or an HTTP request to make a configuration change might require a fair bit of code to be handled properly, but will have little impact on overall processing requirements.
Fast path, slow path
Dividing up the processing in this way provides substantial implementation flexibility. While the slow path processing will almost certainly be implemented with a CPU, fast path processing can be implemented with an FPGA, ASIC, co-processor, or maybe just another CPU. This architecture is particularly strong because it allows you to implement simple time-critical algorithms in hardware and complex algorithms in software.
Now that we have a handle on network processing requirements, let's start looking at network processors.ASICs, the original network processorsOver the last 10 years, demand for higher bandwidth networks has driven the evolution of network equipment design. The first designs used CPUs exclusively. However, general purpose CPUs are not ideal for network programming. While their programmability is important, their floating-point units go unused, they have too much data cache, and too little memory bandwidth. Further, demand for bandwidth is increasing faster than CPU speeds. Network equipment designers cannot afford to wait for the next generation of CPUs to increase the speed of their devices. Even with fast path-slow path designs, problems still arise. For example, how do you make the fast path fast enough?
The conventional answer is to design an ASIC. Well-designed ASICs can be much faster than CPUs, but they are difficult and expensive to develop; the cost of the tools alone make them unaffordable for many companies. Moreover, ASICs usually have limited programmability and must be redesigned as protocols and interfaces change.Network processor companies hope to bridge the divide between ASICs and CPUs by providing a device that is as programmable as a CPU but as fast as an ASIC.
Network processor architectures
Figure 2 is a block diagram of a generic network processor. It does not represent a specific network processor, but includes traits common to most. These traits are:
Programming a network processor
In many ways, network processor architectures look like the parallel processing architectures of a decade ago. Programmers have tried to harness the power of parallel processing architectures for a long time, but with little luck. Vector-processing supercomputers are used for special purpose applications like weather simulation, but programmers have not been successful in using them for general purpose applications.
Is there any reason to think network processors will fare better? Yes, there is. Network processors are not trying to speed up general purpose processing. Network processing has certain characteristics that are very different from general purpose processing. Network processing involves less code but more data than general purpose processing. There is less interdependency between the data. Consider a router again. If a router receives n packets, for a small number n, it can process those packets independently. Another way of saying this is that processing these packets doesn't change the router's state. The exception to this would be configuration packets, or routing protocol packets. However, even these interdependencies are rather loose. If a router receives a packet that indicates it should update its routing tables, there is no reason it can't finish processing a few more packets before it does the update.
If you are evaluating a network processor, you should carefully consider what kind of interpacket dependencies you have, and how each network processor handles them. Network processors designed for very high speed traffic often have no provision for interpacket dependencies and thus would not be appropriate for network devices doing application-level processing.
Speeds and feeds
From reading the marketing literature of network processor vendors, you might believe that all network processors are designed for gigabit speeds, and the faster the better. However, depending on your application, a slower network processor might be a better choice. Network processors designed for the fastest speeds are much more I/O driven, and have less capabilities for pattern matching, sorting out interpacket dependencies, and other features desirable for application-level processing.
Multiprocessing and multithreading
For multi-core network processors and multi-threaded cores, an important question is: who handles scheduling? Consider Figure 3, where six packets are destined for our four-core network processors.
Which packet will be processed by which core? In some network processors, this is determined by the hardware. In others, the software determines the answer. Depending on your application and algorithms, the ability to control which packets go to which cores may be an important requirement. For others, the speed of hardware scheduling may be essential.
The hot news in the network processor market has been acquisitions and standards. Between September 1999 and June 2000, major semiconductor manufacturers went on a buying spree, each acquiring a network processor or acceleration company. During that time, Intel acquired NetBoost, Conexant acquired Maker, Lucent acquired AgÀre, Motorola acquired C-Port, and Vitesse acquired Sitera.
On the standards front, companies in the switch fabric and network processor business have formed two standards bodies. The Common Switch Interface Consortium (CSIX) was formed to standardize a hardware interface between switch fabric chips and processing chips.
The Common Programming Interface Forum (CPIX) was formed to standardize software interfaces for network processors. These two groups include in their membership almost every company that has anything to do with network processing, except Intel.
In particular, the aims of CPIX are interesting: develop software standards for network processors, so that network processor software is portable to different network processors. While this would be beneficial to many network equipment manufacturers, vastly different network processor architectures make that prospect unlikely, at least without large performance sacrifices. Until CPIX releases its standard, it looks more like an anti-Intel coalition than a standards body.
Network processor descriptions
The C-5 DCP has enough processing power to implement both data and control plane operations itself, or it can communicate with a host CPU across a PCI bus interface.
Programming the C-5 DCP is not a small task. With the possibility of writing up to 16 different C/C++ programs for 16 processors, as well as writing microcode for the serial data processors(s), and system level code to tie everything together, a lot of effort goes into harnessing the C-5's power. C-Port's core development tools are based on the popular GNU gcc compiler and gdb debugger, modified by C-Port to work with their RISC cores. To program the RISC cores, you write from one to 16 different programs in C or C++. Then you can debug all of your programs at once using the included C-5 DCP simulator, or you can load your programs on to the C-5 DCP itself, and use gdb to debug them one CPU at a time. C-Port rounds out their development toolset with a traffic generator and performance analyzer.
C-Port provides library routines, named C-Ware, to maintain software compatibility for future generations of DCPs. These routines cover features of both the RISC cores and the co-processors, including tables, queues, buffers, protocols, switch fabrics, kernel services, and diagnostics. The C-Ware reference library includes C-5 implementations of a gigabit ethernet switch, packet over SONET (POS) switch, and ATM switch.
Intel IXP1200 Intel has become a leader in marketing network processors as part of their Internet Exchange Architecture. Currently, most network processor companies are extremely secretive about their products. Intel is the exception. Of the four network processors described in this article, Intel's IXP1200 is the only one for which you can directly download a datasheet from the Web.
The IXP1200, shown in Figure 5, consists of a StrongARM processor, six RISC micro-engines, and interfaces to SRAM/SDRAM memory, PCI bus, and Intel's proprietary IX Bus. The IXP1200 has been designed to do fast path and slow path processing in one chip. The StrongARM portion of the processor can be programmed for the slow path with conventional C/C++ tools. The six micro-engines are designed for fast path processing. Each micro-engine has four hardware contexts and can context switch in a single instruction. The micro-engines are limited to 1MB of program space, which is actually quite a bit, since they are programmed in microcode.
Intel provides assembly tools for the microcode as well as a simulator for debugging the non-StrongARM parts of the IXP1200. Intel ships the IXP1200 development environment with example code for Layer 2 and Layer 3 bridging and routing.
The idea behind the FPP is that there is a large class of network processing functions that require some sort of pattern matching. This includes parsing packets and searching through routing tables. The RSP handles all actions for a particular packet, including packet modifications like routing, and traffic management functions like queueing. The ASI is for sending and receiving slow path packets from a general purpose CPU.
Development kits are available that implement the Lucent network processor using five Xilinx Virtex FPGAs. Clocked at 33MHz, they support full duplex OC-12 interfaces. The tools are not the standard C/C++ development environment that is common with other network processors. The development kit contains:
The Application Code Library includes IP switching and routing over ATM AAL5, over Ethernet, and over Frame Relay.
The Prism's RISC cores have a modified version of the MIPS instruction set with four hardware contexts. Packet scheduling is handled in hardware, with the order management co-processor responsible for resolving packet interdependencies. Sitera offers three variations of the Prism IQ2000, each with the same core but different network interfaces.Sitera's Developer's Workbench is based on the GNU C/C++ compiler, but also includes a simulator and traffic generator. Their reference application code supports Layer 2 and Layer 3 bridging and routing.
Mark Kohler writes networking software in southern California. His interests include computer networking and software engineering. He can be reached at .