Transitioning to multicore processing -

Transitioning to multicore processing

Hesitating to make the shift from single- to multiple-core processing in your design? Here's a guide to making the transition.

The transition to multicore processing requires changing the software programming model, scheduling, partitioning, and optimization strategies. Software often requires modifications to divide the workload among cores and accelerators, to use all available processing in the system and maximize performance. Here's how you and your team can make the switch.

Networking systems, for example, normally include control-plane and data-plane software (shown in Figure 1 ). The control plane is responsible for managing and maintaining protocols (such as OSPF, SNMP, IPSec/IKE) and other special functions such as high-availability processing, hot plug and play, hot swap, and status backup. Control-plane functions include management, configuration, protocol hand-shaking, security, and exceptions. These functions are reliability sensitive but not extremely time sensitive. Normally, control-plane data packets/frames only occupy ~5% of the overall system load.

Click on image to enlarge.

Data-plane functions focus on high-throughput data processing and forwarding. Once the required connections and links are established by the control plane, most traffic is data-plane packets. Normally ~95% of the overall system load will be for data-plane packets and frames. Therefore, overall system throughput and performance primarily depends on data-plane processing capacity, and any optimization in this area can significantly increase system performance. The data plane's software complexity is lower, primarily focusing on packet header analysis, table lookups, encapsulation/decapsulation, counting and statistics, quality of service (QoS), and scheduling, among others.

Example migration
A network router is a good example of the migration from single-core to multicore processing. The software architecture for these products has evolved over the last several years:

  • Unit routers– All software runs on a single-core CPU, including all the control-plane and data-plane modules. These modules are standalone tasks/processes/threads running on a real-time operating system (RTOS). Software integrators must carefully adjust the priorities of each task to achieve improved system performance. Certain high-performance functions such as table lookup actions (FIB, 5-Tuple Classify, NAT) are performed by software, often with the help of offline assistant engines, such as encryption/decryption/authentication, running on a FPGA or ASIC or other acceleration device connected to the CPU for IPSec-related application use. This architecture is for low-end and ultra-low-end unit routers. System performance is lower due to centralized processing on the CPU core.
  • Chassis routers– These products have a more distributed system architecture without significant support from an ASIC. The main processing unit (MPU) cards manage control-plane jobs. Line processing unit (LPU) cards manage data-plane jobs. Each MPU and LPU card contains one single-core CPU. These CPUs are connected through the backplane (normally FE/GE port switch) to each other. All user-end interfaces are provided on the LPU cards. The MPU cards only provide management interface and heartbeat/backup interfaces. The LPU cards may have optional acceleration engines (encrypt/decrypt/authenticate FPGA/ASICs) sitting beside the CPU. The master MPU will discover the routing topologies and generate the FIB (forwarding information base) entries to each LPU. The LPUs will do the data-plane jobs (forwarding and so forth) for user's data packets. Both MPU and LPU are running multiple tasks on top of an RTOS. Overall system performance is much better than in unit routers due to the distributed processing and LPU scalability features.
  • Chassis high-end routers– these routers use a distributed system architecture with an ASIC or network processor (NP). Each LPU card contains additional acceleration (ASIC or NP) powerful enough to perform the data-plane jobs at high speed. Normally, the backplane connecting all ASICs/NPs is composed of some specific crossbar or fabric. And the general CPU on each LPU card will do the IPC (inter processor communication) jobs and configure the ASIC/NP tables. Some differences between the ASIC architecture and the NP architecture exist: The ASIC can provide higher and steadier data-processing rates than the NP, while the NP can provide more flexible functionality. The MPU and LPU will run multiple tasks over an RTOS.

For all three architectures described, the software running on each CPU is still a logical standalone system–the programming model is still single-core. Even for distributed systems, the key system resources are still managed by each CPU, with limited IPC between the CPUs.

Making the switch
When porting to a multicore system, you and your team will be addressing:

  • The overall system partition (mainly cores, memory, and port resources).
  • The operating system (control-plane OS section and migration, data-plane bare-board or light-weight run-time environment).
  • The working architecture of data-plane cores (functionality bound to each core/core-group).
  • A mutex mechanism.
  • The system sharing of data-plane tables among all data-plane cores (what shared memory mechanism to use).
  • The intercore communication mechanism.
  • Whether to use system and CPU global variables.
  • How to migrate the Rx/Tx driver.
  • The architecture-specific accelerators.
  • Communications between control-plane and data-plane partitions.

Partitioning software system
The software system must be partitioned into two parts–control plane and data plane. First decide how many cores to assign for control-plane use and how many for data-plane use. You can use engineering estimates of standard software performance to determine number of cores.

Migrating control-plane software
The control-plane partition will normally run an OS such as Linux or even an RTOS, to provide a multitasking environment for the user software components. Migrating the OS is fairly straightforward, and most legacy control-plane software components will not require large changes for this migration. But a few key points need attention:

  • For the single-core architecture system, control-plane software shares all the data-plane tables inside the same one CPU memory space. Updating these tables requires a direct “write” with a semaphore-like mutex protection. On a multicore platform, the table-update actions are different–table updates are performed either by sending self-defined messages to data-plane cores for the update or via a direct write to the shared table (memory shared between partitions/cores) with splinlock/RCU mutex protection.
  • When using more than one core in the control-plane partition, the most common configuration is the symmetric multiprocessing (SMP) mode. You should check the legacy multitasking software to make sure it will run correctly and efficiently in an SMP environment, especially the inter-task communication (mutex or synchronization) mechanisms.

Migrating data-plane software
Migrating data-plane software to multicore is more difficult. The data-plane partition will typically perform:

  • Data-packet processing.
  • Data communication with the control-plane partition.
  • Management proxy processing.

The legacy data-plane software typically runs on an RTOS that supports a multitasking environment. Data-packet processing is a run-to-completion execution model, executing in one single task/process/kernel-thread. For example in VxWorks, the data-packet processing is done in the tNetTask environment. In Linux, the data packet processing is done in the NET_RX_SOFTIRQ software interrupt environment. Whether using tNetTask or softirq, the priority must be high to prevent being preempted during processing and to keep overall system performance as high as possible.

The management-proxy component in legacy software is typically composed of one or more tasks running in parallel with the data-packet-processing task. The proxy component waits for management or configuration instructions from the control-plane modules to update the data tables or to perform other high-priority tasks. These tasks must have a priority as high as or even higher than the data-packet-processing task. Since the management-proxy task doesn't execute often, the data-packet-processing task will not be preempted often, and impact on system performance will be minor.

When migrating to a multicore environment, the most efficient way to configure the data-plane partition is to run in a “bare-metal” mode or a similar light weight executive (LWE) mode. These are run-to-completion environments and are more efficient than a multitasking environment.

At first glance, it may seem relatively straightforward to migrate legacy data-packet-processing task code to a multicore environment. These tasks are run-to-completion written in standard C code. However, this is true only from a functional perspective. But on the data plane, performance is king and the number one concern for the data-plane partition. To achieve the highest performance possible, some additional optimization is necessary.
Parallel processing of data packets
Consider the execution flow of the data-packet-processing routing function in Figure 2 . This is a simple routing process. The code can be easily ported to the data-plane cores to run in parallel as shown in Figure 3 .

Click on image to enlarge.

Click on image to enlarge.

Multicore processors allow each core to share traffic from a shared user port. For example Freescale's P4080 multicore device uses the Frame Manager (FMan) acceleration block that can load balance traffic from one port to a group of Frame Queues (FQs). Any one of the data-plane cores can receive packets to process from any of these FQ while the sequence number of each of the flows remains unchanged.

In the pseudo code in Figure 2 , rxpkt_from_hw() and txpkt_to_hw() are architecture-specific driver code. On the P4080 platform, this code draws packets from FQs and feeds packets to FQs with the help of a Queue Manager (QMan). This is different from single-core devices, where the Ethernet controller's RxBD and TxBD are memory-mapped to CPU memory space for direct access, which prevents the Ethernet port from being shared by more than one core.

The classify() , ip_table_lookup() and arp_lookup() functions have one key difference between multicore platforms and single-core devices. The lookup tables are shared among all CPU cores. As shown in Figure 4 , three memory types are used on modern multicore platforms:

  • Core private memory.
  • Partition global memory shared among cores.
  • Global memory shared among partitions.

Click on image to enlarge.

On a single-core system, lookup tables are protected by semaphores to provide mutual exclusion. On multicore systems, lookup tables are often protected by spinlocks. But for lookup tables that are read by the data-plane cores more than they're written by the control-plane cores, a better choice would be RCU (read copy update) locks.

Different memory areas have different allocation/free APIs for software use. For example, on P4080's LWE environment, a core's private-memory block is allocated by tlmalloc() while a partition's global-memory block is allocated by malloc() . This also needs to be considered when migrating.

In addition to memory blocks, global variables are divided to multiple types:

  • percpu global-variables.
  • Partition global variables among cores.
  • Global variables among partitions.

As with memory blocks, global variables also have different uses. For example, the macro PERCPU is used for percpu global-vars definition.

In some applications, routing functions are pipelined. One core does the classify() operations and delivers packets to downstream cores to do the ip_lookup() and other functions. However, a pipelined approach doesn't take advantage of warm caches. A parallel approach, described earlier, should provide better overall system performance than a pipelined approach due to warmed-cache effects.

Hybrid approaches
Figure 5 shows a typical QoS-routing process. Data packets will not be sent out directly but queued into a set of software queues. An additional scheduling task will de-queue these packets from the software queues and send them out in a given sequence.

Click on image to enlarge.

On legacy single-core systems, this queue/de-queue operation requires two tasks to implement–one for the packet-processing task and the other for the scheduling task. The software queues are shared between the two tasks and protected by semaphores as show in the pseudo code in Listing 1 .

Click on image to enlarge.

In multicore systems, it's more difficult to allocate these operations to data-plane cores in parallel because of the restriction of having only one QoS scheduler in the system. The ingress pipe (yellow blocks in Figure 5 ) can be run in parallel on multiple cores while the egress pipe (orange blocks) must run on a single core.

Figure 6 shows the partitioning of ingress and egress processing on a multicore device. The data-plane cores are configured in two groups–one for the ingress pipe, the other for the egress. In this case, the egress pipe's core group has only one core. The ingress pipe's cores do the data-packet-processing tasks. The egress pipe's core does the scheduling tasks. The shared soft-queues are protected with spinlocks.

Click on image to enlarge.

Data communication with control-plane partition

On legacy single-core systems, control-plane packets (such as management, protocol handshaking) are branched to the according control tasks at the IP-stack-level, according to the DIP address (local host, multicast, broadcast) or IP-protocol values (OSPF, BGP, IGMP).

Multicore systems often incorporate hardware assistant mechanisms (in the P4080, this is called the Parse-Classify-Distribute (PCD) hardware block) on the ingress side to exact-match control-plane packets. For matched packets, the control plane's receive frame queues (Rx-FQs) will be selected. By default (exact match miss), the data plane's Rx-FQs will be chosen to enqueue. Thus, it's possible for data-plane cores to receive packets that belong to the control plane. Occasionally, packets that cannot be forwarded (due to no route) must be delivered to the control plane for ICMP report replies.

To port this software from a single-core system to a multicore system, a data-channel between the control-plane and data-plane partitions must be established. For example, in the P4080, queue manager frame queues (QMan FQs) can be used as the data-channel. This approach negates the need for a spinlock, provides a uniform interface and creates a more efficient mechanism than the common shared-memory (software message queues) approach.

An internal message communication system is also needed for efficient processing. In addition to the user-data-packet information, other control information (such as reason, actions, src_port ) is needed for the control-plane partition for robust processing. For example, scatter/gather buffer structure on the P4080 make it more efficient to prepatch an additional message to the original data packet before delivering to the control plane.

Management proxy
Management proxies also vary between single-core and multicore systems. Multicore systems targeted for network processing have many management and configuration instructions supporting communication from the control plane to the data plane. Examples include tables update, IP-address/MAC-address configuration, statistics collection, core-state change, and core regroup. Some of these management and configuration operations can be performed directly through global shared memory. Others cannot. A control-channel and internal message system are needed to implement these if the global shared memory approach doesn't support this.

Rx/Tx drivers

Different device architectures have different receive/transmit (RX/TX) drivers. Multicore devices often share Ethernet ports as a necessity. This requires a significantly different Ethernet driver implementation from the legacy memory-mapped BD-ring approach for most single-core devices.

For example on the P4080, the Ethernet ports are virtualized. This brings some new concern on congestion avoidance issues.

Here's an example legacy Ethernet driver pseudo code for the “transmit packet to hardware” (txpkt_to_hw() ):

int  txpkt_to_hw(void ){  ....  if  (enque_pkt_to_txbd() == OK) {   /* TxPkt successfully */   } else  {   /* dport in congestion */    do_congestion_avoidance();  }  return  0 ;}   

In this example, the enque_pkt_to_txbd() function returns the congestion state directly in a synchronous operation. The legacy congestion-avoidance code can be executed immediately after the TxPkt call.

On a multicore device like the P4080 LWE, the txpkt_to_hw() is implemented differently:

int  txpkt_to_hw(void ){  ....again: if  (qman_enqueue() == OK) {   /* EQCR successfully but may be rejected by FQs */  } else  {   /* EQCR in congestion but can't indicate which FQs are 
in congestion */
//can't //do_congestion_avoidance(); goto again; } return 0 ;}

TxBDs in this case are virtualized by the queue manager frame queues. One QMan portal can access many QMan FQs, so the EQCR full status cannot indicate which FQ is in congestion mode (not the legacy “synchronous” mode in this case). The congestion-avoidance software must be changed to an asynchronous approach, which may or may not be desired.

Rob Oshana is director of Global Software R&D for Networking and Multimedia at Freescale Semiconductor. He is also an adjunct professor at Southern Methodist University where he teaches graduate software engineering courses.

Shuai Wang is the team leader of CDC Multicore team in Chengdu, China. He has a master's degree computer science. He has 12 years of experience in embedded and real-time software system, primarily focusing on networking and security fields.

This article provided courtesy of and Embedded Systems Design magazine.
See more articles like this one on
This material was first printed in Embedded Systems Design magazine.
Sign up for subscriptions and newsletters.
Copyright © 2011
UBM–All rights reserved.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.