Network “plumbing” like routers and switches place unusual demands on processors; hence the rise of the network processor. Here's a snapshot of the state of the NPU art, outlining what makes a network processor and what features are likely to become most popular.
Network processors have gone from being overhyped harbingers of a networked world, to an ignored and dying technology, to a solid and thriving business. As the communications recovery continues apace, network processors have found a place as an important tool for any networking-equipment designer. With shipments and design wins rising, it's a good time to take a look at this new type of processor.Network processors have gone from being overhyped harbingers of a networked world, to an ignored and dying technology, to a solid and thriving business.
What is a network processor?
Simply put, a network processor is a programmable microprocessor optimized for processing network data packets. Specifically, it's designed to handle the tasks commonly associated with the upper layers of the seven-layer OSI networking model shown in Table 1: header parsing, pattern matching, bit-field manipulation, table look-ups, packet modification, and data movement. Many independent packets will be available, providing opportunities for parallel processing. Data rates for network processors range from 1.2Gbps (dual OC-12 data rate) to 40Gbps.
Table 1: OSI and TCP/IP layers
|OSI Layer||LayerName||Common Protocols|
|Layer7||Application||HTTP, SMTP, FTP|
|Layer2||Data Link||Ethernet MAC|
Software programmability is an important characteristic of network processors, because it provides flexibility across a range of applications. Even though all network processors are programmable, by definition, not all of them can be programmed by the user. Some vendors restrict access to the underlying instruction set and architecture of their network processing unit (NPU), preferring instead to do all the programming inhouse.
Not a network processor
Many chips are communications processors but not network processors. Communications processors, such as Freescale's PowerQUICC chips, are closely related to network processors but serve applications with lower data rates. Data rates for communications processors range from a few megabits per second to 1Gbps (for instance a single gigabit Ethernet channel). Although this dividing line may seem arbitrary and will certainly change over time, there are some other important, if subtle, differences between these two types of processors.
Communications processors cost less. Their lower prices mean they have more integration than most network processors. For example, communications processors typically contain a RISC processor core that runs a standard MIPS, PowerPC, or ARM instruction set. By contrast, most NPUs don't include such a processor. In a communications processor, it's common for Layer 3 processing and above to be handled by this RISC processor, whereas NPUs commonly handle Layers 3 and above with proprietary packet engines. Many communications processors integrate Layer 1 and Layer 2 processing; most NPUs don't. These differences in price and performance between communications processors and NPUs mean systems designers typically specify them for widely different applications.
A third class of networking chip, custom ASICs, aren't programmable at all. Chip companies sometimes call these packet processors or forwarding engines. These hardwired ASICs compete with NPUs, mostly at the high end of the NPU performance range.
Finally, coprocessor chipssuch as classification engines, search engines, and traffic managersaren't really NPUs because they handle only a portion of the entire packet-processing task. In addition, these devices are typically not programmable, although they're often configurable to some degree. An NPU might contain coprocessors or even rely on external coprocessors, but these coprocessors are not NPUs themselves.
A single network data stream contains a large number of individual packets, each of which can be processed fairly independently. In fact, Internet protocol (IP) allows individual packets within a single data stream to be processed in any order; the receiver must be able to put the packets back in order again. This is quite different from, say, Java bytecode, which consists of a series of instructions that must be processed sequentially.
Because of this independence, packet processing is an ideal application for an array of processors. By dividing up the task, one chip can deliver high performance using several processing units of modest speed. These units needn't squeeze out the last bit of performance, using techniques such as superscalar issue or instruction reordering, which require a great many transistors and a corresponding increase in power consumption. Packet processors can thus be small and efficient.
Instead of combining standard CISC or RISC processors, however, NPU vendors have slimmed down their processors still further. Processing packets is a fairly simple task, consisting mainly of extracting data from a bit stream and doing some pattern matching or table lookups, so a packet processor doesn't need complex arithmetic, fancy addressing modes, or memory-translation units. Bulky circuits such as floating-point units (FPU) and memory-management units (MMU) are generally unnecessary. Instruction caches can be smaller, and data caches are generally eliminated completely, since most network data doesn't recur and is not reused.
We'll call these optimized network processors packet engines , although NPU vendors themselves use a variety of terms such as microengines and channel processors . By eliminating general-purpose CPU features and focusing on just the basics, NPU designers can fit a single packet engine into only a few square millimeters of silicon. They can then liberally sprinkle these tiny engines across a standard silicon chip measuring just 100mm2 or so. Some NPUs combine 64 or more packet engines on a single chip about the size of a Pentium 4.
To take further advantage of the large number of available packets, packet engines are typically multithreaded. In this approach, each engine has one or more packets “on hold” while it processes the current packet. If the current packet stalls for some reason, such as a lengthy memory access, the engine quickly switches to one of the packets on hold. This way, the engine doesn't waste time waiting on memory; instead, it can operate at near-peak efficiency.
A multithreaded processor will usually have extra copies of its programmers' register set, each holding the state of a different packet. Switching packets (or threads) means just pointing to another set of registers and is usually done in a single cycle. Some NPUs make the programmer (or compiler) insert thread-switch instructions; others switch automatically any time there's a memory access.
Although packet engines are often stripped-down RISC processors, they may also have some added features to improve packet performance. Bit-manipulation instructions are one common example. Depending on the particular network protocol, packet headers might have fields that aren't byte-aligned or that consist of just a single bit or a few bits. Standard RISC processors operate only on 32-bit aligned words, so these special instructions can make header analysis and manipulation much easier.
Many NPUs include coprocessors, such as hash engines, search engines, classification engines, or policy engines. These coprocessors are usually not programmable so they don't execute any instructions. Although their functions might be configurable in some way, they're fixed-function additions to the chip.
Finally, some NPUs include a general-purpose processor in addition to the packet engines. We'll call this the control processor , as it typically manages the chip, provides an interface to the control plane, and performs some exception processing. An external host processor is almost always employed to execute the bulk of the control-plane code.
One challenge facing NPU designers is how to organize all this compute power. Although state-of-the-art silicon manufacturing can squeeze dozens of small packet engines on a single chip, it's difficult to connect more than 16 engines to a single memory using an on-chip bus or crossbar. Adding too many engines to a bus can cause contention, delays, and electrical problems, slowing down the entire chip.
One solution is to limit the number of packet engines to 16 while increasing the performance of each engine. Most early packet engines were simple scalar (one instruction at a time) RISC processors. Some new designs have shifted to VLIW (very long instruction word) packet engines to get more performance per packet engine. The superscalar techniques you see in PC and server processors like Opteron, Pentium 4, or SPARC are less efficient than VLIW and aren't needed unless software compatibility is important.
Another approach is to pipeline packet engines in such a way that each group of engines performs a specialized task. EZchip's NP-1 and Agere's APP750, to name two examples, have one group of engines that connect to lookup-table memory, while another group connects to the packet queues. The number of connections to any particular on-chip resource is thus reduced. One downside is that this limits potential applications to those that map well to the chosen pipeline design. For example, these pipelined chips are well designed for processing IP packets, but it's more difficult for them to perform higher-layer functions such as iSCSI and TCP termination.
Other techniques are possible. Bay Microsystems combines pipelining with VLIW packet engines. Xelerated Technologies pipelines VLIW engines in a dataflow fashion that increases efficiency. ClearSpeed uses a single instruction, multiple data (SIMD) technique to organize hundreds of stripped-down packet engines.
Although software compatibility is much less important than it is in the PC business, some NPU architectures now have a significant body of software, including the vendor's own library code, third-party offerings, and customer-created software. As a result, vendors with established NPU architectures are motivated to maintain software compatibility in future products. Vendors can retain compatibility by altering the clock speed and, in architectures with a single-image programming model described later, the number of packet engines. The desire for compatibility, however, prevents any radical architectural changessuch as moving from RISC to VLIW or from a parallel architecture to a pipelined architecturethat might push performance beyond 10Gbps. New NPU entrants, on the other hand, don't have these compatibility constraints so they're more likely to pick a more innovative or aggressive architecture that improves efficiency and scalability.
Fixed or programmable
Another factor you have to consider when choosing an NPU is the use of fixed-function coprocessors to supplement the performance of packet engines. AMCC, for example, uses only six scalar packet engines in its simplex 10Gbps NPU, so there's room for the company to add more engines in future chips. AMCC can get away with so few programmable engines because its fixed-function policy engines, search engines, and traffic managers perform much of the packet-processing task.
Fixed-function logic is generally more efficient for any given task than programmable packet engines, so using coprocessors can increase performance. The obvious downside is reduced flexibility; applications have to fit the capabilities of the fixed-function blocks.
A more subtle issue is relative design difficulty. Fixed-function logic typically implements complex state machines; this complexity balloons as the system designers add more features and options. On the other hand, you can replicate a single packet engine many times and can program it for a variety of tasks.
You should use fixed-function logic judiciously. Consider a device that implements IPv4 and IPv6 routing as well as multiprotocol label switching. In such a case, a single programmable engine can handle all three protocols more efficiently than three separate state machines can. It makes sense, however, to hardwire a function, such as header parsing, that can be easily configured for all three protocols.
Finally, NPU vendors must consider that many networking customers have unique needs that can be handled only through programmability. A flexible, programmable device can more easily support whatever algorithms, protocols, or services a customer might want to implement. A highly programmable NPU will serve the broadest possible market. On the other hand, a well-designed NPU with appropriate use of fixed-function blocks is likely to be more efficient for mainstream applications.
Because of these different design choices, network processors offer various levels of programmability, as Figure 1 shows. At one end of the spectrum is the entirely fixed-function logic of a “net ASIC.” As programmable processor elements are added, the design moves to the right. A fully programmable NPU with few, if any, fixed-function blocks sits at the far right.
Figure 1: Spectrum of programmability among NPUs
Intel's IXP architecture is a good example of a fully programmable design, using packet-engine software to do almost all the work. EZchip has a highly programmable NPU chip, but its traffic-manager chip is not programmable, so the total product is less programmable than Intel's. AMCC's nP chips also use a fixed-function traffic manager, and even the NPU combines limited packet-engine horsepower with plenty of coprocessors. Net ASICs, such as Marvell's Prestera-MX, have no packet engines at all; it's all hardwired.
Networking customers that have unusual or proprietary requirements might need programmability. These customers will look for products on the right side of this spectrum. You might argue that these chips are just as well suited to other customers, who could use as much or as little of the programmability as they need, but this approach has problems.
One downside to programmability is the burden it places on the customer. Intel's IXP family of chips requires eight to 10 times more software than AMCC's does to perform the same tasks. Tasks that are done in AMCC's hardware must be written in software for the IXP. Even if the NPU vendor supplies plenty of reference code, the IXP customer will still have more code to assemble, test, and debug than the AMCC customer. In the extreme case, a hardwired ASIC completely eliminates the need for NPU software, although some host-processor software is still required.
Owing to the efficiency of fixed-function logic, hardwired ASICs can also provide cost, power, and integration advantages. For example, the Prestera-MX delivers full-duplex 10Gbps throughputincluding media-access controllers (MACs), search engines, and egress traffic managementin a single chip costing about $600 and consuming only 7W. With the exception of Xelerated's unique PISC (Packet Instruction Set Computer) architecture, all the available programmable solutions require at least twice as many chips, with more than twice the cost and twice the power.
Most customers are uncomfortable with a hardwired ASIC's lack of flexibility, however. Programmability enables customers to easily differentiate their equipment from what another vendor might offer using the same NPU; with a hardwired ASIC, opportunities for differentiation are more limited. Programmable NPUs also provide a kind of insurance to customers, who can't predict what new features might be needed in two years' time. Without programmability, equipment in the field (or in long development cycles) may become obsolete.
Although some OEMs will choose a hardwired ASIC, most are likely to seek a balance between programmability and performance that meets the needs of their designs. NPU vendors offer a range of alternatives; most customers will find at least one that strikes the right balance.
A typical NPU has four basic external interfaces, as Figure 2 shows. The first is the line interface , which often connects to external MAC or framer chips. Some NPUs include on-chip MACs or framers (or both), in which case the line interface may connect directly to external PHY (physical-layer) devices. The bandwidth of this interface is critical because it limits the maximum amount of data the NPU can accept. In addition, the protocols and flexibility of this interface determine the types of chips the interface can connect to and the protocols (for example, Gigabit Ethernet, OC-48) it can support. Note that most NPUs will support additional protocols, using external glue logic (for example, an ASIC or FPGA), but this increases system cost and, more importantly, design time.
Figure 2: Block diagram of typical NPU
The second interface is the fabric interface , which connects the NPU to the external switch fabric or, in some cases, directly to another NPU. This interface is important for building a line card but might not be used in other designs. The fabric interface should have at least as much bandwidth as the line interface, because nearly all packets in a line card must move across both interfaces. In fact, it should have enough extra bandwidth, typically at least 25%, to support fabric headers, in-band communication, and other fabric overhead.
The memory interface often consists of several separate physical connections. One or more typically connect to packet memory , where packet headers and payloads are stored during processing. Packet memory also holds packets queued for speed-matching reasons or, in quality of service applications, to support multiple priority levels. As a rule of thumb, the packet memory should be 256MB for OC-48, 1GB for OC-192 (or 10Gb Ethernet), and 4GB for OC-768 data rates. For these large arrays, NPUs typically use low-cost DRAM rather than fast but expensive SRAM. Sustained bandwidth to this memory must be at least double the line bandwidth, because each packet must be written to the packet memory and later read back.
The forwarding table is usually stored in a separate table memory , typically implemented with high-speed SRAM. The routing table can range in size from a few hundred kilobytes to several megabytes, but it's generally much smaller than the packet memory, to keep cost down. Layer 3 routers might use 32-bit table entries for IPv4, but more-complex applications can use entries of 300 bits or more, especially when supporting IPv6. Each packet requires one to three table accesses, so the table memory must supply enough bandwidth to support this rate at “wire speed” (the raw speed of the network cable or fiber).
Most NPUs have one or more memories for storing instructions (called the control store) that the various packet engines and control processors will execute. In most cases, the packet engines' control store is inside the NPU chip, since fast-path code is typically only a few thousand instructions long. Many NPUs also have a ROM that contains the boot code.
Finally, most NPUs have a host interface that connects to an external host microprocessor. This interface is typically PCI, since many embedded processors connect to PCI either directly or through a standard chip set. To increase bandwidth, some NPUs offer 66MHz PCI in addition to the standard 33MHz version. PCI Express could eventually displace PCI, especially in high-end NPUs.
For NPUs, the bandwidth requirement of the host-processor interface varies widely by application. Some line cards won't need any local host processor, relying instead on a centralized control-plane processor residing elsewhere in the system. Products that perform TCP termination might use the host-processor interface as a means to transfer packet data. There's no simple rule of thumb for the ratio of host-processor bandwidth to packet-data bandwidth.
An NPU's programming model is also important. Most NPUs offer a run-to-completion model , in which a single packet engine completely processes a packet. In this model, packets can be assigned to packet engines by software or hardware.
Other NPUs use a pipelined model , in which a packet is passed from packet engine to packet engine as it's processed. This approach simplifies the hardware design, but it requires dividing the packet-processing software into roughly equal-length stages. This can be challenging, particularly when multiple network protocols are in use. A hardware pipeline is also more difficult to adapt to unusual or unexpected applications.
Most NPUs offer a single-image programming model that hides hardware complexity from the programmer. These NPUs require all packet engines and threads to run the same code, simplifying software development. This is most common in run-to-completion designs. Some pipelined NPUs have multiple packet engines per stage and use a single code image per stage.
Other NPUs use a symmetric multiprocessor (SMP) model , which allows each packet engine to execute its own software. The SMP model requires software to assign packets to the multiple packet engines and otherwise coordinate them. This model offers maximum flexibility, allowing the packet engines to be used either in parallel or in series (in other words, a “software pipeline”). Programmers can also assign some packet engines to one task while others handle a different task. But the SMP model results in more-complex software designs that typically require performance tuning.
Given the number of NPUs on the market, ease of programming is a significant factor in choosing a product. Some vendors believe it's best to write data-plane code directly in assembly, as the code's performance inevitably must be optimized by hand. Others believe that writing in a high-level language improves software reusability and, more important, quicker time to market.
The advantages of assembly code are performance and compactness. For most current applications, the data-plane code has only a few thousand lines, much of which the NPU vendor may supply in a library. As more protocols and services are added, however, the data-plane code will become more complex.
Most NPU vendors offer compilers for their packet engines, while a few vendors require customers to program in assembly language. Most compilers implement a subset of C, or at least use C-like syntax, but they don't support standard C libraries. Furthermore, packet-engine control stores are very small, ranging from 1,000 to 16,000 instructions. Consequently, existing C code can't easily be ported to an NPU. The compilers do at least offer programmers a familiar language for faster startup and prototyping. Nevertheless, even with a compiler, some assembly-level optimization is often required.
Keep in mind that coding in a high-level language is no panacea. For most NPUs, even high-level code must directly access on-chip coprocessors and be optimized to fit the chip's internal architecture (for example, pipelined or parallel). These issues reduce productivity; they also prevent code from being easily ported from one NPU to another. Although a lot of programmers might be more comfortable dealing with C-like syntax than with assembly-language mnemonics, they still need to learn the microarchitecture of the target NPU and code directly to it.
Most successful NPU vendors offer customers a choice, supplying both a compiler and an assembler, along with optimized code libraries for common data-plane functions. Some customers can then code directly in assembly language, while others may use the compiler.
NPU vendors are also providing prebuilt code to help their customers get to market faster. By supplying the basic data-plane functions, they allow customers to concentrate on higher-level features that can provide differentiation. Offerings vary widely, from basic IP-forwarding reference code to production-ready code for DiffServ, MPLS, and ATM. The amount and quality of a vendor's library code can be an important factor when choosing an NPU.
The main competitors to NPUs are general-purpose microprocessors and custom ASICs. In previous networking systems, vendors used microprocessors to perform routing functions in low-end devices because of their low cost, general availability, and ease of programming. Microprocessors don't have enough performance for high-bandwidth devices, however, so these boxes used custom ASICs.
ASICs provide ultimate control over the design. Using ASICs, a networking vendor can create highly differentiated products. On the other hand, ASICs have long design cycles, long debug cycles, and high development costs. ASIC development will generally be the critical path for these systems. New ASIC designs take 9 to 18 months and cost millions of dollars. Any bugs require chip revisions (spins) that add months to the schedule. As a result, ASIC development is the riskiest portion of system development.
Like standard microprocessors, network processors are programmable and available off the shelf, yet they can match the performance of ASICs in demanding networking applications. By buying a third-party NPU, an equipment maker bypasses the entire ASIC design cycle and its associated risk. Instead, the NPU vendor bears this cost.
NPUs replace fixed-function ASICs with a programmable design, providing additional advantages. A programmable device shortens the design cycle and is more easily modified to support new or evolving standards. Programmability not only accelerates time to market, it can even enable an NPU-based router to be field-upgraded with a new protocolsomething you can't do with a hardwired solution.
Field upgrades to add new protocols need an NPU with processing headroom. If the initial product deployment uses all the processing power of the NPU, feature and protocol additions can't work using software alone. The designer should leave enough headroom for the desired lifespan of the product. Headroom might also improve time to market for the initial design. If a compiler is available, software can be written in C (producing less-efficient code) and optimized later, when processing cycles are needed for new features.
A network processor isn't always the perfect solution, however. Although NPUs can shorten hardware design cycles, they increase software development effort. Some customers have plentiful ASIC-design resources but relatively limited software-development staff. Large-scale use of NPUs will mean rebalancing these resources. For these and other reasons, ASICs will never completely disappear from networking equipment.
Unlike embedded processors, which are widely supported by third-party development tools, NPUs have to be programmed using vendor-proprietary tools. For initial designs, learning an NPU architecture and tool set can substantially lengthen the overall system-development cycle. Once an equipment vendor has an existing code base, however, subsequent design cycles will be much shorter. Now that several NPU vendors offer a range of compatible chips, an equipment maker can reuse its software across low-end, mid-range, and high-performance products.
Linley Gwennap is the principal analyst of The Linley Group (www.linleygroup.com), a technology analysis firm focusing on networking semiconductors. Linley can be reached at .