Multi-core mania has definitely hit the embedded networking market, but as the dust begins to settle it has become clear that many important architectural details need to be examined closely before decisions are made about how to partition applications across multiple cores.
The multi-core processors used today for networking equipment commonly target enterprise-level access routers, but the ones being marketed today offer much more than just layer-2 and layer-3 routing. Many higher-layer services, such as layers 4 through 7, are being added by networking companies trying to differentiate their products while absorbing into their equipment some of the specialized appliances that have been added to networks in the past, thus reducing the operating expenses incurred by diversified networking solutions. Many of the systems currently on the market are running their control and data planes, as well as all additional services, on single-core processors. However, single-cores are hitting a frequency ceiling set by system power budgets, a problem that simply can't be solved by transistor technology. Thus, the need for more performance and more widely differentiated services in next-generation systems makes these systems ideal candidates for multi-core devices, offering a way for system vendors to increase system performance and add new services while staying within the power budgets so often driven by system locations and end-users' pocketbooks. Multi-core processors offer outstanding performance per watt by simply dialing back core frequencies to use less aggressive transistor technology, thus saving die area and reducing static and dynamic power consumption.
Partitioning the control and data planes across multiple cores
Thus, when designing a complete solution, one of the first application decisions that must be made is how to partition the control plane, data plane, and services across multiple cores. Multi-core devices exist that have cores ranging from 600 MHz to 2 GHz in numbers ranging from 2 cores to 16; and some niche processors opt for even higher numbers. It must be determined whether a given application's control plane can run on more than one core, and, if so, how many cores are needed to provide the necessary control plane performance. If the control plane is so multi-core-unfriendly as to be unable to run on more than one core, the speed of the cores will be determined by the level of performance needed by the application's control plane, a criterion that quickly eliminates some of the lower-speed multi-core processors. Furthermore, if the control plane software is able to be spread across multiple cores, can it do so without an underlying OS abstracting the number of cores from the application itself? Because many control plane tasks cannot be spread across multiple cores running in Asymmetric Multiprocessing (AMP mode), this abstraction is usually done by a Symmetric Multiprocessing (SMP)-capable operating system. However, it has been demonstrated that the performance gain when more than two cores are used is less than compelling: valuable CPU cycles are wasted just for SMP coordination tasks, and inefficiency worsens as the number of cores increases. Many times control plane processing needs can push an application into a multi-core device with higher-speed cores, allowing the control plane to stay on a single core, or (at most) on two cores, while meeting the performance requirements of the application's control plane.
 |
|
Partitioning software for an access router on a multi-core processor exhibit flexibility for an application supporting both control and data plane code on the same device.
|
The data plane's code is packet-processing code and is much easier to run on multiple cores operating in parallel. This code can be more easily partitioned across multiple cores, because packets can be processed in parallel as multiple cores run identical instances of the packet-processing data path code. Two primary usage models exist for processing packets across multiple cores. In one model, strict flow affinity is used to ensure limited or even zero synchronization between cores that are handling packets having no direct relation to each other, and thus no shared state. This scheme requires a fairly low level of determination in order to be able to use multiple cores at the same time, as well as a preprocessor to do the math. In the second model, a simpler scheduling algorithm is used, which means that more than one core can be processing packets on the same flow at the same time. Packet ordering and locking of shared data structures must be handled at either the hardware or the software level to ensure that packet ordering is preserved and that the shared context is being updated at least atomically (and sometimes in the correct order as well). Because waiting on shared data structures can create idle core cycles, optimizing this code (and also some types of hardware offloads) can help minimize the performance loss when updating shared data structures. Targeted hardware offload of order restoration can also free up core cycles by allowing the core to move to the next packet while the hardware maintains the ordered queue.
Multi-core partitioning and I/O connectivity
These new multi-core devices present an interesting challenge in terms of I/O connectivity. In many instances, it is not acceptable to statically tie a particular I/O interface to a particular core. Application partitioning can be influenced by the need for resource sharing across multiple cores. In an access router application, many I/Os will be receiving packets for processing, and just as many ways exist by which to intelligently distribute packets among the numerous cores wanting to send data out a common I/O. Many of the current solutions include some hardware-accelerated packet parsing and classification on ingress that helps them distribute packets, but an even simpler method uses one or more cores to run code that determines where to send incoming packets. Sometimes, when a core is used as the distribution mechanism, software can be more portable between processors; but the core can become a performance bottleneck not only for the device but for the entire system (imagine a single core attempting to service a 10Gb/s Ethernet link just to be able to classify and distribute data to other cores!). As mentioned earlier, core frequency will continue to fall behind processing demands (a situation compounded by increasingly speedy I/O); so it is important to ensure that packet distribution mechanisms are not bottlenecks and that they are flexible enough to determine each packet's initial destination. Egress cores need to be able to drop packets to the I/O and move on without having to coordinate the usage of the I/O in a multiplexed packet division scheme. This requires a level of I/O virtualization that must be implemented in hardware.
 |
|
A multi-core platform has many shared resources within the platform including I/Os, memory, and accelerators.
|
Sharing resources in multi-core
The partitioning of the application across multiple cores also means partitioning any shared resources, such as off-chip memory and on-chip caches. As the numbers of cores on a single die increases, off-chip memory bandwidth and on-chip L2 caches haven't increased under the stress of pin count and die size. Because control plane, data plane and services code all have different-sized instruction footprints, process states and shared data, a multi-core processor must be able to partition memory resources effectively. The more flexible the cache allocation schemes and cache hierarchy on the device are, the better performance will be. When multiple cores share resources, such as caches, cache trashing becomes very common, because the applications on each core cast out data in use by applications on other cores, causing a performance hit across the SoC and causing the cores to run strictly out of external DRAM. In most applications, the ideal solution combines moderate-sized dedicated caches for the cores (for non-shared state) along with larger, shared caches (for shared state and non-shared overflow) that can be flexibly allocated in such a way that processes and applications on one core cannot affect the data cached for use by applications running on other cores. This is important not only for performance reasons, but because the hardware mechanisms must be in place for access protection. Undoubtedly cases may occur in which some software running on some cores should be isolated from other software running on other cores. This allows trusted and untrusted software to be included on an integrated platform without fear of rogue code impacting the execution of critical code, and it helps localize software bug impacts to only those cores involved in a specific function.
Another issue that must be examined is cores' uses of shared resources, such as hardware accelerators (which include pattern matching and encrypt/decrypt engines, among other things). These hardware resources are necessary to support higher-layer features like intrusion prevention, virtual private networking, and stateful firewall at the higher performance levels required by the enterprise market today. How are these resources managed to make sure each core gets its allocated share of the available resources? Hardware acceleration can be added to each core in the form of specialized instructions or added as a look-aside accelerator shared by many cores. Packet processing code must be especially designed to take best advantage of hardware acceleration. If processing is added in the core, acceleration is available equally to all cores, something that is easy to program and to use but rarely best for performance. If any core is running code that doesn't need a particular acceleration resource, that resource is unavailable to other cores, a common inefficiency in this type of distributed acceleration. Yet, when partitioning software across cores, it is rarely possible to do so in such a way that all cores require the same acceleration.
Another deficiency of this approach is that it may present acceleration by optimizing instructions for a particular task, something that offers no offload. A good example is a core either doing its normal packet processing or using the execution unit to perform the accelerated task. The core cannot move on to process another packet when it is executing the optimized instructions. Packet-processing code using look-aside acceleration can process a packet, pass it to the accelerator for processing and begin processing another packet, acting in a much more "pipelined" fashion. The accelerator returns the packet to a core for further processing after completing its task. This type of acceleration allows the cores to make full use of the accelerator's performance. Although it creates some overhead in the form of commands sent to and received from the accelerator, overall it increases the core's instructions per cycle by freeing the core to do other processing while a packet is being processed by the accelerator (in other words, acceleration-and-offload). It also demands a hardware virtualization service such that cores can send work to, and receive it from, the accelerator without relying on software to coordinate its use.
Single-core devices can no longer keep up with the performance needs of networking equipment customers. As network bandwidth and demand for high-level services increase, multi-core offerings are the only way to reach the levels of packet-processing needed today. Today's multi-core processors include shared resources such as I/Os, caches and external memory; and accelerators are becoming increasingly vital in attaining the performance levels necessary for highly advanced networking. These factors help make partitioning control plane and data plane code across multiple cores a particularly important issue for the software engineers and systems designers who create advanced networking equipment.
David Kramer (david.kramer@freescale.com) is Chief Architect, IP Development, and Steve Cole (steve.cole@freescale.com) is Senior System Architect, for Freescale Semiconductor's Networking & Multimedia Group.