Achieving higher performance in a multicore-based packet processing engine design

Michael Coward, CTO and co-founder, Continuous Computing

January 1, 2008

Michael Coward, CTO and co-founder, Continuous Computing

A new class of processor has begun to appear in a variety of storage, security, wireless base stations, and networking applications to replace the very expensive - with long lead times to boot - proprietary Application Specific Integrated Circuits (ASICs) developed by OEM system solution providers as well as those designed by industry giants, such as LSI Logic and IBM.

This new class of multi-core processor is made up of eight, sixteen, even sixty-four individual processor cores with integrated memory controllers, various I/O interfaces, and separate acceleration engines.

Though this new class of processor has made great strides in overcoming the limitations of earlier generation processors, not all of the "new class" of multi-core processors are created equal. Some companies that develop these processors add threading capability to overcome memory latency, and also include native 10Gbps interfaces, while others include security engines and even regular expression engines that support very special applications.

Rather than examining all the features across a number of multi-core processors and comparing them bit by bit, this paper will focus on one critical architectural element, the memory subsystem. The memory subsystem is critical because this is a major factor in determining the scalability and upper limits of performance that a processor can achieve.

The memory architectures compared here are based on two leading multi-core processors in the market today:

1. Single channel, wide cache line (Single / Wide)
2. Dual channel, narrow cache line (Dual / Narrow)

The question to be addressed is: Which architecture is superior in providing the performance necessary to keep up with the ever growing voice, video, and data traffic that the market is requiring today?

Single Channel, Wide Cache Line (Single / Wide)
The single channel, wide cache line approach uses a single memory channel as the interface between the processor and DDR2 memory. The width of the channel is 128-bits and uses 16-bits of ECC for a total of 144-bits. In this "Single / Wide" approach, cache lines of 128-bytes are used and every access to memory is a burst-of-8 reads or writes.

The result of this approach is that every burst to memory fills or empties a single cache line. With support for DDR2-800 memory, the Single / Wide approach has a memory bandwidth of 12.8GBps, and is achieved by supporting a potential of 100 million transactions per second, where a transaction is either a read or a write of a 128-byte cache line.

Dual Channel, Narrow Cache Line (Dual / Narrow)
The dual channel, narrow cache line architecture uses a different approach for maximizing memory performance. The "Dual / Narrow" architecture utilizes two memory channels as the interface between the processor and DDR2 memory where each channel is 64-bits wide with 8-bits of ECC.

The cache lines in this architecture are 32-bytes and every access to memory is a burst-of-4 reads or writes. This architecture similarly fills or empties an entire cache line with a single transaction. The Dual / Narrow architecture achieves the same 12.8GBps raw memory bandwidth, but reaches this figure through 400 million possible transactions per second.

From a theoretical perspective, at DDR2-667 speeds, the Single / Wide memory interface performance is 83 million cache line operations per second, while the Dual / Narrow approach is 334 million cache line operations per second. However, DDR2 memory is far from ideal and has a number of factors that reduce the theoretical performance, including:

1) Refresh times
2) Bus turnaround times
3) Bank access time limitations

Simulations were developed to compare the two architectural approaches. For a typical configuration of 4GB of DDR2-667 memory and a packet classification workload as described below, the Single / Wide architecture yields 64 million cache line operations per second, while the Dual / Narrow architecture yields 204 million cache line operations per second.

It is important to note that although the Single / Wide architecture has an efficiency of 77%, [64MOps actual / 83MOps potential], compared to 61% efficiency [204MOps actual / 334MOps potential], the Dual / Narrow architecture provides more than three times the number of transactions per second. As discussed below, this plays a significant role in packet throughput in real applications.

A Common Application " Load Balancing / Packet Distribution
AdvancedTCA (ATCA) packet processor blades are often called upon to act as a front-end for an entire chassis of blades. In these applications, the packet processor connects to the network on one side and to a set of application blades on the other side.

Furthermore, the packet processor blade acts as load balancer and allows the entire collection of application blades to appear as a single IP address " critical to hide the internal complexities of the system from the network.

To gain an understanding for the challenge a solution must undertake to perform 10Gbps of load balancing and network address translation (NAT), consider a system specified to run at 10Gbps with minimum sized 64-byte packets " which is 16.4 million packets per second, in each direction, or 32.9 million packets per second through the packet processor.

An optimized load balancer / NAT engine will execute the following steps for each packet:

1. Receive packet and place into cache memory
2. Perform a flow lookup
3. Modify the packet header per the flow
4. Increment statistics about the packet / flow
5. Send the packet from cache to the next process

Note that this represents the best case - the packet is never stored to DRAM - only to cache memory, so the number of memory accesses is kept to a minimum.

< Previous
Page 1 of 2
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER