Achieving higher performance in a multicore-based packet processing engine design -

Achieving higher performance in a multicore-based packet processing engine design

A new class of processor has begun to appear in a variety of storage,security, wireless base stations, and networking applications toreplace the very expensive – with long lead times to boot – proprietaryApplication Specific Integrated Circuits (ASICs) developed by OEMsystem solution providers as well as those designed by industry giants,such as LSI Logic and IBM.

This new class of multi-core processor is made up of eight, sixteen,even sixty-four individual processor cores with integrated memorycontrollers, various I/O interfaces, and separate acceleration engines.

Though this new class of processor has made great strides inovercoming the limitations of earlier generation processors, not all ofthe “new class” of multi-core processors are created equal. Somecompanies that develop these processors add threading capability toovercome memory latency, and also include native 10Gbps interfaces,while others include security engines and even regular expressionengines that support very special applications.

Rather than examining all the features across a number of multi-coreprocessors and comparing them bit by bit, this paper will focus on onecritical architectural element, the memory subsystem. The memorysubsystem is critical because this is a major factor in determining thescalability and upper limits of performance that a processor canachieve.

The memory architectures compared here are based on two leadingmulti-core processors in the market today:

1. Single channel, widecache line (Single / Wide)
2. Dual channel, narrow cacheline (Dual / Narrow)

The question to be addressed is: Which architecture is superior inproviding the performance necessary to keep up with the ever growingvoice, video, and data traffic that the market is requiring today?

Single Channel, Wide Cache Line(Single / Wide)
The single channel, wide cache line approach uses a single memorychannel as the interface between the processor and DDR2 memory. Thewidth of the channel is 128-bits and uses 16-bits of ECC for a total of144-bits. In this “Single / Wide” approach, cache lines of 128-bytesare used and every access to memory is a burst-of-8 reads or writes.

The result of this approach is that every burst to memory fills orempties a single cache line. With support for DDR2-800 memory, theSingle / Wide approach has a memory bandwidth of 12.8GBps, and isachieved by supporting a potential of 100 million transactions persecond, where a transaction is either a read or a write of a 128-bytecache line.

Dual Channel, Narrow Cache Line(Dual / Narrow)
The dual channel, narrow cache line architecture uses a differentapproach for maximizing memory performance. The “Dual / Narrow”architecture utilizes two memory channels as the interface between theprocessor and DDR2 memory where each channel is 64-bits wide with8-bits of ECC.

The cache lines in this architecture are 32-bytes and every accessto memory is a burst-of-4 reads or writes. This architecture similarlyfills or empties an entire cache line with a single transaction. TheDual / Narrow architecture achieves the same 12.8GBps raw memorybandwidth, but reaches this figure through 400 million possibletransactions per second.

From a theoretical perspective, at DDR2-667 speeds, the Single /Wide memory interface performance is 83 million cache line operationsper second, while the Dual / Narrow approach is 334 million cache lineoperations per second. However, DDR2 memory is far from ideal and has anumber of factors that reduce the theoretical performance, including:

1) Refresh times
2) Bus turnaround times
3) Bank access time limitations

Simulations were developed to compare the two architecturalapproaches. For a typical configuration of 4GB of DDR2-667 memory and apacket classification workload as described below, the Single / Widearchitecture yields 64 million cache line operations per second, whilethe Dual / Narrow architecture yields 204 million cache line operationsper second.

It is important to note that although the Single / Wide architecturehas an efficiency of 77%, [64MOps actual / 83MOps potential], comparedto 61% efficiency [204MOps actual / 334MOps potential], the Dual /Narrow architecture provides more than three times the number oftransactions per second. As discussed below, this plays a significantrole in packet throughput in real applications.

A Common Application ” LoadBalancing / Packet Distribution
AdvancedTCA (ATCA) packet processor blades are often called upon to actas a front-end for an entire chassis of blades. In these applications,the packet processor connects to the network on one side and to a setof application blades on the other side.

Furthermore, the packet processor blade acts as load balancer andallows the entire collection of application blades to appear as asingle IP address ” critical to hide the internal complexities of thesystem from the network.

To gain an understanding for the challenge a solution must undertaketo perform 10Gbps of load balancing and network address translation(NAT), consider a system specified to run at 10Gbps with minimum sized64-byte packets ” which is 16.4 million packets per second, in eachdirection, or 32.9 million packets per second through the packetprocessor.

An optimized load balancer / NAT engine will execute the followingsteps for each packet:

1. Receive packet and placeinto cache memory
2. Perform a flow lookup
3. Modify the packet headerper the flow
4. Increment statistics aboutthe packet / flow
5. Send the packet from cacheto the next process

Note that this represents the best case – the packet is never storedto DRAM – only to cache memory, so the number of memory accesses iskept to a minimum.

Flow Lookup Algorithms
As packets are received into the system, they must be categorized as towhether or not they match an existing flow or are part of a new flow.This is normally done using a 5-tuple match, where the five fields thatdefine the flow are matched against a database of existing flows:

1) Source IP Address
2) Source Port
3) Destination IP Address
4) Destination Port
5) Protocol

The most common lookup function to check a database of existingflows is a hash lookup. Hash lookup is where a key is created based onthe 5-tuples and then indexed into a list of matching keys.

The keys point to records that define each flow and records may bechained together in case multiple 5-tuples hash to the same value. Eachlookup requires a minimum of two memory lookups, one to search the listof keys and a second to retrieve the flow record. If multiple flowshash to the same key, additional memory accesses will be required tofollow the list of chained records.

In order to minimize the number of collisions, the number of hashbuckets is normally chosen to be at least 2x larger than the number ofexpected flows, and even with 2x buckets, 2.24 memory accesses will berequired on average. With 10x more buckets than flows, this drops to2.05 memory accesses per packet.

Statistics. Once the flow has been located, statistics about the flow must beupdated. In the highest performing NAT engines, these statistics arestored in the same cache line as the flow record, meaning that thestatistics are already in memory once the flow has been located. Oncethe statistics are incremented, the cache line must be written back tomain memory, requiring one further memory access.

CachePerformance. These flow lookups and statistics update operationsmake the cache memory perform poorly because the number of packet flowstends to be much larger than the number of cache lines, meaning that agiven flow is unlikely to be in main cache at any given time.

Example: Assume 500K flows,with 4M hash buckets. If each hash bucket is an 8-byte pointer, andeach flow record is 32-bytes, then the hash table is 32MB (4M *8-bytes), and the flow table is 16MB (500K * 32 bytes). With a 2MBcache, the chance that a given flow will already be in cache is only 4%(2 / 48). With the 3.05 memory accesses required per packet, the cacheonly has a small impact and drops the average memory accesses perpacket to 2.93.

Table1. Comparison of memory architectures

Required Memory Performance
A highly optimized load-balancing engine / NAT engine can be createdrequiring on average 2.93 memory accesses per packet. Given the memorythroughput for the Single / Wide and Dual / Narrow architecturesdiscussed previously, the maximum packet rate and throughput for thetwo architectures can be calculated as shown in Table 1 above.

This table highlights the impact of the memory architecturedifferences between the Single / Wide and Dual / Narrow approaches. TheSingle / Wide approach is only at 66% of line rate with DDR2-667 andcannot reach 10G full-duplex even with DDR2-800 memory.

On the other hand, the Dual / Narrow architecture easily reaches 10Geven with the slowest DDR2-400 memory, and with standard DDR2-667memory the architecture delivers more than twice the memory performancerequired for full duplex 10GbE; thus, providing significant headroomfor additional lookups and advanced functions.

The reason for the large difference between the two architecturescan be found in the cache line differences. The Single / Wide approachis designed with unusually large 128-byte cache lines, but typicalnetwork and packet processing applications require only 8- and 32-bytelookups.

As a result, most of each cache line is wasted. The Dual / Narrowarchitecture, on the other hand, has a cache line size of 32-byteswhich more closely matches what is required in typical network andpacket processing applications and results in higher performance.

Memory AccessBudget. A second way to look at the problem is to calculate thenumber of DDR memory accesses allowed per packet at 10G full-duplex.With 32.9 million packets per second, the Single / Wide architectureallows 1.9 DDR memory accesses per packet, while the Dual / Narrowarchitecture permits 6 DDR memory access per packet. Again, the Dual /Narrow architecture provides much higher performance.

When evaluated against a simple load balancing / NAT application, evenwhen highly optimized to require less than 3 memory accesses perpacket, the Single / Wide approach cannot deliver 10Gb line rate fullduplex performance, while the Dual / Narrow architecture provides twicethe necessary lookup bandwidth.

Most packet processing applications are considerably more complexthan this simple load balancer / NAT application and do require morelookups and statistics updates.

In addition, this analysis did not include any overhead forslow-path processing, fast-path management, or security processing,which suggests that the true performance of the Single / Wide approachwill be even lower than analyzed here. Ultimately, the Dual / Narrowarchitecture is required to achieve 10Gbps line rates and above innetwork and packet processing applications.

Michael Coward, is the CTO and co-founder of ContinuousComputing. Mr. Coward specializes in system architecture and thedesign of highly available redundant platforms, including the creationof the company's Ethernet-HA architecture which replaces the PCI buswith redundant Ethernet links and allows for the creation of highlyscalable, distributed systems with superior redundancy and resiliency.Mr. Coward has long experience in the telecommunications industry. Heholds an M.S. in electrical engineering from the California Instituteof Technology.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.