Achieving higher performance in a multicore-based packet processing engine design
A new class of processor has begun to appear in a variety of storage,
security, wireless base stations, and networking applications to
replace the very expensive - with long lead times to
boot - proprietary
Application Specific Integrated Circuits (ASICs) developed by OEM
system solution providers as well as those designed by industry giants,
such as LSI Logic and IBM.
This new class of multi-core processor is made up of eight, sixteen,
even sixty-four individual processor cores with integrated memory
controllers, various I/O interfaces, and separate acceleration engines.
Though this new class of processor has made great strides in
overcoming the limitations of earlier generation processors, not all of
the "new class" of multi-core processors are created equal. Some
companies that develop these processors add threading capability to
overcome memory latency, and also include native 10Gbps interfaces,
while others include security engines and even regular expression
engines that support very special applications.
Rather than examining all the features across a number of multi-core
processors and comparing them bit by bit, this paper will focus on one
critical architectural element, the memory subsystem. The memory
subsystem is critical because this is a major factor in determining the
scalability and upper limits of performance that a processor can
achieve.
The memory architectures compared here are based on two leading
multi-core processors in the market today:
1. Single channel, wide
cache line (Single / Wide)
2. Dual channel, narrow cache
line (Dual / Narrow)
The question to be addressed is: Which architecture is superior in
providing the performance necessary to keep up with the ever growing
voice, video, and data traffic that the market is requiring today?
Single Channel, Wide Cache Line
(Single / Wide)
The single channel, wide cache line approach uses a single memory
channel as the interface between the processor and DDR2 memory. The
width of the channel is 128-bits and uses 16-bits of ECC for a total of
144-bits. In this "Single / Wide" approach, cache lines of 128-bytes
are used and every access to memory is a burst-of-8 reads or writes.
The result of this approach is that every burst to memory fills or
empties a single cache line. With support for DDR2-800 memory, the
Single / Wide approach has a memory bandwidth of 12.8GBps, and is
achieved by supporting a potential of 100 million transactions per
second, where a transaction is either a read or a write of a 128-byte
cache line.
Dual Channel, Narrow Cache Line
(Dual / Narrow)
The dual channel, narrow cache line architecture uses a different
approach for maximizing memory performance. The "Dual / Narrow"
architecture utilizes two memory channels as the interface between the
processor and DDR2 memory where each channel is 64-bits wide with
8-bits of ECC.
The cache lines in this architecture are 32-bytes and every access
to memory is a burst-of-4 reads or writes. This architecture similarly
fills or empties an entire cache line with a single transaction. The
Dual / Narrow architecture achieves the same 12.8GBps raw memory
bandwidth, but reaches this figure through 400 million possible
transactions per second.
From a theoretical perspective, at DDR2-667 speeds, the Single /
Wide memory interface performance is 83 million cache line operations
per second, while the Dual / Narrow approach is 334 million cache line
operations per second. However, DDR2 memory is far from ideal and has a
number of factors that reduce the theoretical performance, including:
1) Refresh times
2) Bus turnaround times
3) Bank access time limitations
Simulations were developed to compare the two architectural
approaches. For a typical configuration of 4GB of DDR2-667 memory and a
packet classification workload as described below, the Single / Wide
architecture yields 64 million cache line operations per second, while
the Dual / Narrow architecture yields 204 million cache line operations
per second.
It is important to note that although the Single / Wide architecture
has an efficiency of 77%, [64MOps actual / 83MOps potential], compared
to 61% efficiency [204MOps actual / 334MOps potential], the Dual /
Narrow architecture provides more than three times the number of
transactions per second. As discussed below, this plays a significant
role in packet throughput in real applications.
A Common Application " Load
Balancing / Packet Distribution
AdvancedTCA (ATCA) packet processor blades are often called upon to act
as a front-end for an entire chassis of blades. In these applications,
the packet processor connects to the network on one side and to a set
of application blades on the other side.
Furthermore, the packet processor blade acts as load balancer and
allows the entire collection of application blades to appear as a
single IP address " critical to hide the internal complexities of the
system from the network.
To gain an understanding for the challenge a solution must undertake
to perform 10Gbps of load balancing and network address translation
(NAT), consider a system specified to run at 10Gbps with minimum sized
64-byte packets " which is 16.4 million packets per second, in each
direction, or 32.9 million packets per second through the packet
processor.
An optimized load balancer / NAT engine will execute the following
steps for each packet:
1. Receive packet and place
into cache memory
2. Perform a flow lookup
3. Modify the packet header
per the flow
4. Increment statistics about
the packet / flow
5. Send the packet from cache
to the next process
Note that this represents the best case - the packet is never stored
to DRAM - only to cache memory, so the number of memory accesses is
kept to a minimum.