Achieving higher performance in a multicore-based packet processing engine design
Flow Lookup Algorithms
As packets are received into the system, they must be categorized as to
whether or not they match an existing flow or are part of a new flow.
This is normally done using a 5-tuple match, where the five fields that
define the flow are matched against a database of existing flows:
1) Source IP Address
2) Source Port
3) Destination IP Address
4) Destination Port
5) Protocol
The most common lookup function to check a database of existing flows is a hash lookup. Hash lookup is where a key is created based on the 5-tuples and then indexed into a list of matching keys.
The keys point to records that define each flow and records may be chained together in case multiple 5-tuples hash to the same value. Each lookup requires a minimum of two memory lookups, one to search the list of keys and a second to retrieve the flow record. If multiple flows hash to the same key, additional memory accesses will be required to follow the list of chained records.
In order to minimize the number of collisions, the number of hash buckets is normally chosen to be at least 2x larger than the number of expected flows, and even with 2x buckets, 2.24 memory accesses will be required on average. With 10x more buckets than flows, this drops to 2.05 memory accesses per packet.
Statistics. Once the flow has been located, statistics about the flow must be updated. In the highest performing NAT engines, these statistics are stored in the same cache line as the flow record, meaning that the statistics are already in memory once the flow has been located. Once the statistics are incremented, the cache line must be written back to main memory, requiring one further memory access.
Cache Performance. These flow lookups and statistics update operations make the cache memory perform poorly because the number of packet flows tends to be much larger than the number of cache lines, meaning that a given flow is unlikely to be in main cache at any given time.
Example: Assume 500K flows, with 4M hash buckets. If each hash bucket is an 8-byte pointer, and each flow record is 32-bytes, then the hash table is 32MB (4M * 8-bytes), and the flow table is 16MB (500K * 32 bytes). With a 2MB cache, the chance that a given flow will already be in cache is only 4% (2 / 48). With the 3.05 memory accesses required per packet, the cache only has a small impact and drops the average memory accesses per packet to 2.93.
![]() |
| Table 1. Comparison of memory architectures |
Required Memory Performance
A highly optimized load-balancing engine / NAT engine can be created
requiring on average 2.93 memory accesses per packet. Given the memory
throughput for the Single / Wide and Dual / Narrow architectures
discussed previously, the maximum packet rate and throughput for the
two architectures can be calculated as shown in Table 1 above.
This table highlights the impact of the memory architecture differences between the Single / Wide and Dual / Narrow approaches. The Single / Wide approach is only at 66% of line rate with DDR2-667 and cannot reach 10G full-duplex even with DDR2-800 memory.
On the other hand, the Dual / Narrow architecture easily reaches 10G even with the slowest DDR2-400 memory, and with standard DDR2-667 memory the architecture delivers more than twice the memory performance required for full duplex 10GbE; thus, providing significant headroom for additional lookups and advanced functions.
The reason for the large difference between the two architectures can be found in the cache line differences. The Single / Wide approach is designed with unusually large 128-byte cache lines, but typical network and packet processing applications require only 8- and 32-byte lookups.
As a result, most of each cache line is wasted. The Dual / Narrow architecture, on the other hand, has a cache line size of 32-bytes which more closely matches what is required in typical network and packet processing applications and results in higher performance.
Memory Access Budget. A second way to look at the problem is to calculate the number of DDR memory accesses allowed per packet at 10G full-duplex. With 32.9 million packets per second, the Single / Wide architecture allows 1.9 DDR memory accesses per packet, while the Dual / Narrow architecture permits 6 DDR memory access per packet. Again, the Dual / Narrow architecture provides much higher performance.
Summary
When evaluated against a simple load balancing / NAT application, even
when highly optimized to require less than 3 memory accesses per
packet, the Single / Wide approach cannot deliver 10Gb line rate full
duplex performance, while the Dual / Narrow architecture provides twice
the necessary lookup bandwidth.
Most packet processing applications are considerably more complex than this simple load balancer / NAT application and do require more lookups and statistics updates.
In addition, this analysis did not include any overhead for slow-path processing, fast-path management, or security processing, which suggests that the true performance of the Single / Wide approach will be even lower than analyzed here. Ultimately, the Dual / Narrow architecture is required to achieve 10Gbps line rates and above in network and packet processing applications.
Michael Coward, is the CTO and co-founder of Continuous
Computing. Mr. Coward specializes in system architecture and the
design of highly available redundant platforms, including the creation
of the company's Ethernet-HA architecture which replaces the PCI bus
with redundant Ethernet links and allows for the creation of highly
scalable, distributed systems with superior redundancy and resiliency.
Mr. Coward has long experience in the telecommunications industry. He
holds an M.S. in electrical engineering from the California Institute
of Technology.



Loading comments... Write a comment