The Case For A Standalone Classification Processor - Embedded.com

The Case For A Standalone Classification Processor

ABOUT THE AUTHOR

Feliks J. Welfeld earned a Master's Degree in Electronic Engineering from Warsaw University of Technology. Welfeld has over 20 years experience in computer science and software engineering in various design and project management positions. Before co-founding Solidum, he worked as Senior Programmer for Cadlink Technology and as Director of Software Engineering for the Aurigor Telecom Systems. Welfeld's expertise focuses on systems and software design, software development process and methodology, and real-time and low-level programming.

Service providers search for new ways to generate revenue and deliver enhanced communications services to their customers. This means the pressure has increased on equipment manufacturers who must deliver the networking hardware that will help service providers achieve their market objectives. Over the past three years, this pressure has translated into a quest by manufacturers for the Holy Grail of fast, flexible, and efficient packet processing. The quest began with highly integrated system-on-a-chip (SoC) options, but has migrated towards a central processor and a collection of specialized, narrow-function co-processors.

Today, the packet processing architectures that designers develop for the next generation of networking equipment, such as switches, gateways and routers, are built around a network processor unit (NPU) and four primary functional blocks: classification, modification, encryption, and traffic management (Figure 1 ). The NPU acts as a workload coordinator for the more specialized function Network Processing Elements (NPEs) and, of these, classification is considered to be the most important.

Figure 1:  Network processing framework

A recent survey conducted by Key3Media and published in May 2001 showed that, “56% of NPU users intend to use a classification co-processor to augment the capabilities of their NPU.” By offloading the processing-intensive classification task to a specialized processor, system designers believe they can optimize the processing power of their NPU and build more functionality into their systems. But what are the quantifiable benefits a classification processor delivers in a network-processing architecture?

The Network Processing Forum has defined the term NPE as a programmable device capable of wire-speed processing of network data. From this definition, we see two major parameters that define the effectiveness of a network-processing solution:

  • Processing speed
  • Programming flexibility.

Solidum recently completed extensive analysis of the benefits classification processors provide in a network-processing architecture and has quantified the benefits in these key areas.

Speed and Programmability

For all NPUs and processing technology (including RISC, CISC, VLIW, and others), speed and programmability translate into:

  • Number of instructions to be executed per unit of processing
  • Size of the code storage area.

Usually, in order to achieve the required speed, instructions have to be stored close to the processing core. This places a limitation on the size of the code being stored. In today's architecture the code is usually limited to 8000 instructions or less.

By comparing two implementations—one based on classification being done completely by the NPU and the other based on a NPU supported by a classification processor—we can quantify the performance benefits of using a classification processor with a RISC-based integrated NPU.

In a very simple DiffServ application with Multi-Field (MF) classification (see RFC 2475) we can show that the benefits of using a classification processor are quite substantial. The results are even more significant for more complex applications that require content inspection, which is beyond the capability of most NPUs.

Speed of Processing Savings

To look at the resources required to perform classification on a traditional RISC CPU-based architecture of a NPU we need to make certain assumptions. For example, assume that the policy for our classifier has the following parameters:

  • m—total number of entries in the rule table (but has no impact on the final NPU usage)
  • n—bits of interest
  • k—total bits in inspected headers.

Each rule specifies conditions imposed on n bits of interest in all of the headers that go as deep as k bits into the PDU from the PDU beginning.

Let's assume that the RISC NPU has the following parameters:

  • p—processor word size
  • q—processor clock rate.

We will also assume that:

  • The software running on a NPU
    1. Reads k bits from a PDU
    2. Parses out an n-bit key
    3. Uses a hash table to look up this key.
  • The hash function operates on 64-bit values
  • The hash table, as well as PDU data, is stored in SSRAM
  • I/O for a read operation with the SSRAM requires four clock cycles.

For our calculations:

bits of interest = n = 128 bits
total bits in header inspected = k = 400 bits.

For the CPU:

processor word size = p = 32 bits
processor clock rate = q = 232 MHz.

Processing starts with reading a packet's protocol headers for parsing the key. There is one word of parsing rule-data read from SRAM per one word of packet data read. Each SRAM read operation, as mentioned earlier, takes four clock cycles. Application of the rule to the packet data is another two clock cycles. Therefore, the time to extract an n-bit key out of k bits of headers:

t1 = (k/p)reads * (4+4+2) cycles /q = .54 µs

Assuming 12 cycles to calculate a hash value out of 64 bits of input key data, the time to calculate a total hash value is:

t2 = (n/64 fields * 12 cycles) / q = .103 µs

After calculation of the hash function, the n-bit key used to calculate the hash value needs to be compared with the hash table data pointed to by the hash value. This requires an additional SRAM read.

To be generous in our assumptions we do not consider the possibility of a hash collision and the time required resolving this collision. We also do not factor in the possibility of the key not being found in the table.

The time for comparison of the hash table result, assuming four clock cycles to read a 64-bit word from SRAM and 1 cycle to compare, is:

t3 = (n/64 fields * (4 + 1) cycles) / q = .12 µs

Therefore, the total time that will be spent on classification can be estimated as:

total time for classification = t = t1 + t2 + t3 = 0.763 µs

At 1 Gbps, assuming a minimum PDU size of 512 bits (64 bytes) and eight execution threads (eight processing cores running in parallel), the total time available to process a PDU is:

tPDU = (512 bits x 8 engines) / 1 Gbps = 4 µs

before the next PDU arrives.

This processing involves not only classification, but also modification, traffic shaping, and queuing of the PDU. Thus, the total percentage devoted to classification is:

tPDU = .763 ms / 4 µs = 19%

Even in this simple case of classification, with all the generous assumptions, a designer could add almost 20% to the PDU processing time budget by outsourcing classification processing to a specialized classification processor.

Obviously this 20% could be 50% if complex encapsulations or variable size headers are involved.

Code Space-Savings Analysis

Coding the application previously described in a RISC instruction set results in a code size of approximately 3000 instructions. Classification represents approximately 400 instructions.

An external classification processor produces classification results in the form of a tag. The tag is a 32-bit value that can be read by the NPU, rather than the NPU performing the classification task. It takes less than 30 instructions to read and process the tag, resulting in a saving of:

400 – 30 = 370 instructions

By using a specialized classification processor, a designer can achieve over 10% savings in the number of instructions used for PDU processing.

Content Inspection Analysis

Similar to the speed of processing savings, the code-space savings for a simple application will be higher in more complex applications that require packet content inspection. In fact, the classification requirements can easily exceed the capabilities of most RISC-based NPUs.

URL switching devices such as Web load balancers, virus scanners, firewalls, and other devices that must manage Layer 7 packet classification perform content inspection. The process involves an operation known as 'string searching', whereby the packet or 'string' is searched for a match with one or more of many potential policy rules or 'patterns'.

Several optimized algorithms can do string searches. For the purpose of this exercise, however, we will work with a brute force algorithm for the following reasons:

  • It consumes the least amount of code and data space —both are precious resources in the NPU world. More efficient algorithms may not fit in the code space of modern NPUs, which are often limited to 2000-8000 instruction maximums and must handle more than classification processing within the code-space allotted.
  • It has zero setup time. Other algorithms work faster with static policies, but if the policy is dynamic, the extra processing that is required dilutes the lower setup time benefit they offer.
  • It delivers very good best-case performance, which we will use for our example.

For this analysis, we will assume a string-search algorithm in a RISC CPU meta-code. We will also assume that the data is coming from the wire at a rate of 1 Gbps and that the CPU is running at 1 GHz, executing one instruction per clock cycle. Based on these assumptions, the CPU must process n bits with less then n+1 instructions in order to properly keep up with the flow of data.

In our implementation, the algorithm searches for a match of one of the N patterns in a string that has a length of TLen (Figure 2 ).

Figure 2:  Data structures

The algorithm loads a 32-bit word of the string into CPU register R0 and a 32-bit word of one of the patterns into register R1. It then compares the two registers byte-by-byte, since the pattern may not be 32-bit aligned within the searched string.

We assume that the data in our string will not match any of our patterns and that the mismatch will occur at the first byte of a pattern. There are several loops in the algorithm:

  • An outermost loop that goes over all the 32-bit words of a string
  • Another loop that goes over all N patterns
  • A third loop that goes over 4 bytes of each word.

Figure 3 shows the metacode that manages this algorithm.

Figure 3:  Brute force algorithm

The loop over N patterns is not shown fully. We mention N patterns just to justify inclusion of loading R1 with pattern data into the encapsulating loop.

Based on this example we can tally the instructions required to process one 32-bit word. This particular example requires at least 40 instructions. We say at least, since we do not count iterations over N patterns.

The loop over 4 bytes of a word can be 'unrolled'—this operation could save decrement and branch instructions. We did not include this option in order to maintain brevity and clarity in our sample code. In addition, unrolling will not change the outcome: more than 32 instructions are needed to process 32 bits of data.

Based on this example, in order to handle the simplest case of content inspection in a RISC-based NPU, the processor must operate at an aggregate speed well above 1 GHz. Given that PSM-based technology can handle the same task at a clock speed of 100 MHz, RISC-based implementations are neither economical nor practical.

Making the Case

If service providers are going to build application-aware IP networks that can meet the ever-increasing demand for enhanced services such as converged voice and data, streaming video, point-and-click calling, secure virtual private networks, and differentiated Quality of Service (QoS) levels they will need fast, flexible, and efficient packet processing.

You can achieve such efficiency only with powerful network edge equipment that effortlessly integrates QoS features for a variety of traffic at blinding speeds. This equipment must have the flexibility to scale to many ports, adapt to emerging standards, and permit policy changes on the fly. You need equipment that can intelligently manage a myriad of ever-changing protocols, up to and including the application layer, without the need for forklift upgrades whenever those protocols change.

The equipment that can do all this must be built with network-processing architectures configured around a central network processor and highly specialized co-processors.

One of the most important co-processors in these architectures is the dedicated classification processor. Used properly, a classification processor can optimize network processing to deliver enhanced next-generation services with true QoS and enable service providers to create new revenue streams.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.