Freescale’s Robert Oshana walks the embedded software developer through a multicore “decision tree” for selecting software components best suited to your application, such as RTOS, Linux, RT-Linux, or none.
Multicore processing has reached the deployment stage in embedded systems. The transition from single to multicore processors is motivated primarily by the need to increase performance while conserving power. This has placed more responsibility on the shoulders of software engineers to get it right.
The first step is being able to select the right multicore software architecture for the application. There are several ways to benefit from multicore processing. Software migrations will most likely start from serial code bases. Therefore, the target software design needs to identify the solution to meet the migration requirements.
There are several factors that will guide the plan for multicore migration. Factors include the starting point (design) of the original source code, as well as migration goals and constraints. Each method has its own strengths.
This article will walk through a multicore “decision tree” to select that multicore software architecture best suited to the application space under consideration. This decision tree is shown in Figure 1.
Click on image to enlarge.
Decision 1: Select the programming model
The first decision is to decide whether the programming model should be Symmetric Multiprocessing (SMP) or Asymmetric Multiprocessing (AMP), keeping in mind that the application can be partitioned to support both.
Choose SMP if one operating system will be run, using all of the cores as equal processing resources, and the applications can be parallelized to benefit from SMP systems.
SMP requires application analysis to identify opportunities for parallelism in the code and then rewriting the code to achieve this parallelism using multithreading. For CPU intensive code, which is difficult to redesign for parallel processing using SMP and multithreading, asymmetric multiprocessing (AMP) could be a good alternative solution.
AMP requires no application changes to leverage the benefits of multiple cores. AMP leverages multiple cores by running multiple instances of the OS and application in separate partitions that are dedicated to specific cores, peripherals, and system memory areas.
This can be beneficial when an increased level of security is required on some cores. Devices are assigned to cores and the I/O connections and secure applications are separated by the MMU.
Keep in mind, AMP requires a boot loader that supports AMP and to partition the hardware resources and make OS and application assignments to those partitions. The OS must be relocate-able, must be able to restrict its memory region, and must only operate on its assigned PCI devices.
Decision 2: Choose the Operating System Framework
If AMP is chosen, the next step is to determine which Multicore Operating System Framework is required.
Unsupervised AMP runs multiple operating systems on the cores. This has the advantage of being able to keep the system 'up' if one of the OSs crashes. Establishing the separation of multiple OSs on top of a single processor with shared memory can be problematic.
AMP partitioning can solve the scalability problem of SMP, but only if the algorithm is well parallellizable. This is usually a problem left for the system integrator.
If unsupervised AMP is not the right choice then there are a number of “supervised” AMP options as shown in Figure 1. There are three primary supervised AMP options using virtualization.
Virtualization provides a software management layer that provides software protection between the different partitions as well as core management in order to optimize power efficiency. The CPUs are run as multiple independent partitions running their own OS and application.
For applications designed using multiple components that are independent and CPU bound with little contention to shared resources, this is the way to go.
Legacy software changes are not needed when using virtualization to partition multiple OSs to run within virtual machines (VMs). The Virtual Machine Manager (VMM) manages the assignment and access between the VMs and platform resources. A number of software technologies are available to enable virtualization in embedded systems:
* OS-level virtualization: uses the capabilities of an operating system kernel to enable multiple isolated instances of user-space. Each user space instance has its own private, isolated set of standard operating system resources, and applications run in this isolated “container”.
Linux containers are an example of OS-level virtualization. Containers are used in situations that require application consolidation, sandboxing, or dynamic resource management and all the software domains involved are Linux applications. It is not possible to boot an operating system in a container.
* “Type 1” hypervisors : A “type 1” hypervisor runs directly on system hardware, and is not part of a general purpose operating system a type 2 hypervisor. These are generally small, efficient hypervisors that enable the secure partitioning of a system’s resources.
A system’s CPUs, memory, and I/O devices are statically partitioned, with each partition being capable of running a guest operating system. Hypervisor do not use schedulers and simply spatially partition the CPUs.
* “Type 2” hypervisors : Type 2 hypervisors use an operating system as the basis for the virtualization layer. Virtual machines run alongside other OS applications. An example of a type 2 hypervisor is KVM (Kernel-based virtual machine).
KVM is an open source software virtualization technology also based on the Linux kernel. KVM enables Linux to act as a virtual machine monitor. KVM is essentially a Linux kernel driver that collaborates with a QEMU user space application to create and run virtual machines.
Clickon image to enlarge.
Virtualization and partitioning in embedded systems enable some benefit to be gained from multi-core processors independent of explicit OS support. The ideal situation is to have symmetric multiprocessing and asymmetric multiprocessing, including virtualization, at your disposal. A summary of these primary multicore software configurations is shown in Figure 2 .
Decision 3: Determine the Control Plane and Data Plane Model
If we go down the SMP side of the decision tree we must choose whether our SMP configuration will be “data plane” or “control plane” focused.
Data plane configurations are throughput intensive (e.g. packets per second) and usually need a light weight or real time operating system, or another light weight programming model for handling throughput requirements on the data plane side.
For performance sensitive applications where throughput is important, one such approach that is gaining popularity in the multicore space is “user space” application development. It is a framework of Linux user space drivers that allow customers to develop high-performance solutions (Figure 3) .
Its high-performance stems from doing I/O that bypasses Linux kernel so no system calls needed. Application developers with their own software often like this model. Another advantage is keeping application software out of the kernel avoid GPL license contamination.
Clickon image to enlarge.
Decisions 4 & 5: Choose the type of OS needed for the Control Plane and Data Plane
Data plane processing, in many cases, does not require an operating system. There is typically no requirement or need to provide services to a user or otherwise restrict access to the underlying hardware through a restrictive set of APIs.
In addition, fast path processing does not require direct intervention by the user as packet processing is done automatically.
Many of the other functions typically handled by an OS such as process management (there is usually only one task per core), memory management (pre-allocated buffers are used), file management (no file system), and device management (low level access functions and APIs are used). A data plane OS is used to support legacy code or to do some basic scheduling when the need arises. Choose a simple run to completion model or a RTOS if necessary.
It is common to have the multicore applications that are allocated to the control plane layer running under the control of an operating system. These applications typically do not have any real time latency or throughput constraints as it relates to packet processing.
Much of the complex processing required on the control plane and the need to reuse existing code bases makes the interaction with an OS a prerequisite. Linux is a common choice for an operating system for control plane processing as it has added increased support for SMP processing.
Some of the improvements include an adjustment to the way the kernel supports the file systems, a number of routing and device-handling optimizations, removal of the Big Kernel Lock (BKL) which should increase Linux performance on larger SMP-based systems, the ability to throttle input and output, improved power management, and upgrades to the CPU scheduler.
Decision 6: Determine the Type of Acceleration Needed
Multi-core network acceleration is necessary for packet processing. TCP/IP stacks are not designed to work well with multicore systems. Most network packet processing protocols can be broken down into two paths.
* Stateless path, also known as the data path, requires quick and efficient switching/routing of packets. This can be broken down into packet identification (classification) and forwarding.
* Stateful path, also known as the control path, requires more processing and has more inherent latency than the data path. The stateful control path requires 90% of the code and is used 10% of the time. The stateless data path requires just 10% of the code and is used 90% of the time (Figure 4 ).
Fast Path technology is used to accelerate the 10% of the code in the stateless path to increase packet processing performance.
Application Specific Fast Path (ASF) is a software based solution that stores flows requiring simple, deterministic processing in a cache. ASF recognizes cached flows and processes such packets in a separate highly optimized context (Figure 5 ).
ASF accelerates the data throughput for networking devices ASF in software provides optimized implementation for Data Path processing that is customized for platforms for achieving higher throughput for specific applications.
Clickon image to enlarge.
It leverages functionality provided by hardware like hashing, checksum calculation, cryptography, classification, scheduling to provide higher throughput. The focus of ASF is to accelerate the processing of many relevant applications. Some examples include:
IPv4 Forwarding –Create an ASF forwarding cache. When packets match entries in the forwarding cache, the packets get forwarded at the driver level, without going through the Linux Networking Stack.
Firewall + NAT – Maintain a 5 tuple based session table. When packets match the session table, the packets can be scanned for vulnerabilities, have address translation performed and be forwarded.
IPsec – Maintain a database of associations from flows to SA (Security Association). When packets match the database, the packets are encrypted or decrypted and routed appropriately.
IP Termination – Accelerate the pre-configured locally terminated or originated flows. It can work in conjunction with PMAL- user-space Zero Copy Mechanism.
Linux/Linux RT: A multicore fast-path alternative
An alternative fast path acceleration technique is to use Linux as a real-time operating system. A real-time system is one in which the correctness of the computations depends not only upon the logical correctness of the computation but also upon the time at which the result is produced.
If the timing constraints of the system are not met, the system will fail. Many embedded systems now have both real-time and non real-time tasks running on the OS (Figure 6).
It is very difficult to design a system that provides both low latency and high performance. However, real-world systems (such as Media, eNodeB, etc.) have both latency and throughput requirements.
For example, an eNodeB basestation has a 1 ms hard real time processing deadline for Transmission Time Interval (TTI) as well as a throughput requirement of 100 Mbps downlink (DL) and 50 Mbps (UL). This requires the need to tune the system for the right balance of latency and performance (Figure 6 ).
Linux now provides soft real-time performance through a simple kernel configuration to make the kernel fully preemptable. In the standard Linux kernel, when a user space process makes a call into the kernel using a system call, it cannot be preempted.
This means that if a low-priority process makes a system call, a high-priority process must wait until that call is complete before it can gain access to the CPU.
The Linux configuration option CONFIG_PREEMPT changes this behavior of the kernel and allows Linux processes to be preempted if high-priority work is available, even if the process is in the middle of a system call.
RT-Linux also makes the system preemptive including more granular spin-locks, making the interrupt handlers kernel threads by default, and allowing higher priority tasks even if in user-space to preempt lower priority tasks at any point of time.
After working your our way through the multicore decision tree, we end up at a set of leaf nodes that describe the reference software architecture needed to implement the high level system software requirements (path through the decision tree) for the multicore system. Some of these examples are shown in Figure 7 .
Clickon image to enlarge.
Figure 7a: Multicore Software Reference Architecture requiring an open source virtualization solution (KVM), a Linux OS on the control plane with high performance user space I/O requirements and a fast path capability for required data flows.
Clickon image to enlarge.
Figure 7b: Multicore Software Reference Architecture requiring an light weight embedded virtualization solution, two Linux OS partitions on the control plane with high performance user space I/O requirements and a fast path capability for required data flows.
Clickon image to enlarge.
Figure 7c: Multicore Software Reference Architecture requiring a Linux SMP solution on the control plane with a light weight data plane environment and a fast path capability for required data flows.
We’ve discussed a number of multicore software architecture components as we assemble our multicore reference architectures. In fact we can view these architectural components as an set of building blocks or leggo bricks that can be put together in different ways to create the multicore software reference architecture (Figure 8 ).
This approach can produce a scalable solution customized to the needs to the developer for a variety of multicore software solutions.
Rob Oshana is director of software R&D, Networking Systems Group, Freescale Semiconductor .