Imagine: you are the chief software architect of a new embedded design, a decade from now. You are contemplating the microprocessor selected by the hardware guys and bean counters and wondering how in the world you’re going to make best use of its fire power to build the most ambitious product your company has ever conceived.
This SoC has dozens of general purpose processing cores; hundreds of Gbps memory bandwidth across multiple memory controllers; 64-bit addressing; multiple high-speed packet interfaces capable of maxing out numerous 10 gigabit Ethernet interfaces simultaneously; a RAID accelerator; a packet-deduplicator; a compression engine; three-levels of cache; a dizzying array of peripherals (USB, UART, SD card, and more).
In addition it often has a regular expression pattern matching engine; a packet scheduling and routing infrastructure; hypervisor acceleration; a sophisticated security engine with support for symmetric, public key, and hashing; and an amazing suite of on-chip debugging features.The chip reference documentation is many thousands of pages long.
Frightened yet?Well, I’m about to make it worse. Remember the part about being 10 years away? Just kidding. I just described features found on today’s high end multicore network processors from Cavium (OCTEON II), Freescale (QorIQ), LSI Corp (Axxia), and NetLogic (XLP). A quick look at a block diagram for one of these, the Freescale P4080, is enough to make you swoon (Figure 1 below ).
Figure 1. Freescale QorIQ P4080 multicore processor
This is just the high level view. Each of the subsystems is extremely complex. Again using the Freescale example, the P4080 has an awesome complement of debugging features (Figure 2, below ), if you can solve the halting problem to use them all: on-chip and off-chip instruction and data trace, performance counters, inter-core cross triggers, user-programmable performance and engine-monitoring events that feed the trace logic, etc.
Figure 2. Freescale QorIQ On-Chip Debugging Architecture
These bad boys don’t program themselves. The only hope to maximizing the potential of these processing behemoths is with some ridiculous software smarts.We’re not going to solve the entire problem in this article, but we’re going to talk about the software layer that controls the platform – the operating systems and hypervisors upon which everything else rides.At the very least, the chief software architect needs to understand the major options and some of the key tradeoffs between them.
In all of the following options, we assume that the design includes some non-real-time software (such as management and health monitoring, control plane routing protocols like OSPF, and human-machine interfaces) as well as real-time processing (such as high-speed data processing and low latency device drivers ).
Option 1: Linux SMP. Linux is a popular choice for the control plane processing in networking equipment. However, Linux is not a real-time operating system and is hence not able to make hard guarantees about worst case response times. Furthermore, its millions of lines of code are simply overkill and overweight for data processing applications that beg for minimalist run-times coupled with finely tuned use of all those fancy hardware accelerators.
Nevertheless, developers can attempt to shoehorn Linux into handling both control and data plane processing by using SMP core binding to tie down data processing threads to cores. The Linux SMP architecture is shown in Figure 3 below.
Figure 3. Linux SMP
Option 2: RTOS SMP. Because of their real-time, low latency capabilities, real-time operating systems are a popular choice for networking equipment and many other multicore-powered embedded systems. While a non-real time OS can’t perform real-time tasks, a real-time OS can obviously perform non-real time tasks. Therefore, an SMP RTOS such as Green Hills Software’s INTEGRITY or Intel’s VxWorks (Figure 4 below ) is a natural replacement for Linux from Option 1:
Figure 4 – RTOS SMP
Option 3: Lightweight Executive. SMP starts to become a performance challenge as we move into the many-core arena. OS services require protective kernel locking which can increase latency when many cores are trying to concurrently access those services.
While some architects will look at SMP as a simple and elegant solution to the problem of managing lots of cores, other architects will want to divide the cores amongst distinct software teams, each focusing on their respective areas of the project.
Allowing each team to run their own independent operating systems tailored for their particular workloads can actually improve project development efficiency.Finally, as mentioned earlier, a lightweight run-time environment may be preferable for data processing tasks over a full-blown operating system.
All of the aforementioned processor vendors provide a lightweight executive (LWE) for this purpose.In some cases, the LWE is nothing more than some initialization code, device drivers, and a superloop without threads.The processor vendors also provide highly optimized libraries, already tuned for the LWEs, to manage the various accelerators.
When LWEs are employed, the control plane can be handled either by Linux or an RTOS, resulting in an Asymmetric Multiprocessing (AMP) Environment. If the control plane OS is running Symmetric Multiprocessing (SMP) over a subset of the cores, you actually have a combination of SMP and AMP (Figure 5 below ).
Figure 5. Partitioning of control OS and LWE
In place of a chip vendor-supplied LWE, a simple real-time microkernel, such as Green Hills Software’s µ-velOSity or Express Logic’s ThreadX, provides similar functionality from an independent software vendor.
Option 4: Linux and RTOS. It used to be that embedded designers needed to choose between Linux and an RTOS. However, multicore processors are now commonly hosting both.Similar to Option 3, Linux runs in SMP mode on the control plane while the RTOS manages the real-time workloads, either in an SMP or AMP fashion (Figure 6 below ):
Figure 6. Linux and RTOS hybrid
An important advantage Option 2 has over Options 3 and 4 is the ability of a single high reliability operating system to completely control the hardware.With Options 3 and 4, the control plane OS has no direct control over the real-time cores and vice-versa.Thus, control and data plane workloads could interfere with each other.
For example, an errant DMA programmed by the LWE could corrupt the control plane OS. Another example is cryptography and key management: malware in Linux could access critical algorithms and parameters controlled by a security subsystem running on other cores.In other words, the system has a division of resources but lacks strict isolation and access control of those resources.
The good news is that hardware hypervisor support is being integrated into these multicore processors. For example, the Freescale P4080 supports the hypervisor mode extensions in Power Architecture ISA 2.06, enabling full virtualization of guest operating systems with minimal overhead.
Virtualization software and hardware can transform Option 4 into a strictly partitioned system that still retains the flexibility of running different operating systems for control and data plane workloads. In fact, some real-time operating systems have virtualization built-in, precluding the need for a separate hypervisor layer (Figure 7 below ).
Figure 7 – Linux and RTOS partitioning using virtualization
In the above diagram, the multivisor runs the high-performance, low-latency real-time threads directly while executing Linux SMP and its control plane software in a virtual machine.
A hypervisor-managed system has other important advantages over the traditional AMP division of labor. Virtualization provides the flexibility of changing the allocation of control and data plane operating systems to cores.
For example, in a normal mode of operation, the architect may only want to use a single core for control activities and all other cores for data processing. However, the system can be placed into a maintenance mode in which Linux is allowed to use four cores (SMP) while the data processing is temporarily throttled back. The virtualization layer can handle the reallocation of cores seamlessly under the hood, something that a static AMP system cannot support.
Interprocess communications (IPC) is another key advantage of hypervisors. Inevitably, the control and real-time subsystems will need to communicate. For example, control plane routing changes will need to be communicated to the hardware-accelerated forwarding engines in the data plane.
With an AMP system, a custom IPC mechanism is typically implemented with new backplane device drivers for the control and data plane operating systems.In contrast, the hypervisor provides built-in IPC mechanisms that allow virtualized guests to use its preexisting interfaces (such as an Ethernet driver ) without any custom changes.
Linux thinks it is sending data over a NIC, but the NIC is virtualized and its data transfers converted under the hood to backplane messaging built-in to the hypervisor.
If you want to deal with a single run-time software platform and less vendors, a single SMP operating system (Options 1 or 2) may be a good choice.If your system does not have hard real-time or high assurance security-critical requirements and future product generations will never add those requirements, then Linux SMP may be sufficient for the job. If the system has or may someday have those types of stringent performance and/or reliability needs, then an RTOS is a safe, future-proof choice.
If the design requires the software ecosystem of Linux and the capabilities of an RTOS (or LWE), a hybrid AMP model (Options 3 or 4) may be a good choice.However, if you prefer this hybrid architecture and the processor provides a hypervisor mode, then it would be silly not to take advantage of that (Option 5) for improved robustness, core management flexibility, and overall ease of use.
Many other factors may impact your architectural decision. The quality of the software development tools integrated with the applications, operating systems, hypervisors, and microprocessor is often overlooked by architects.
The multicore hardware debugging features (such as the aforementioned features of the P4080) must be harnessed by system analysis tools that convert the data into useful information for finding bugs, identifying performance bottlenecks, and visualizing sophisticated system behavior.These tools may well make the difference between getting to market on time and drowning in a sea of complexity.
Stability and trustworthiness of systems software vendors is another key consideration. In today’s era of supplier consolidation and economic strife, it is increasingly difficult to make a safe long-term decision about vendors upon which you must bet the success of current and future projects.
In addition to the obvious financial and corporate stability concerns, architects must consider the availability of integrated and third party middleware stacks and drivers and the level of working relationship between the processor vendor and the software supplier.
Finally, a vendor’s adherence to open API standards and ability to provide long-term support for all leading multicore processor families enables an architect to retain flexibility of choice over time.
Modern multicore SoCs present software architects with incredible power that must be properly managed and controlled. The first step is to understand the major platform architecture choices and tradeoffs.Good decisions here can lead to successful designs for many product generations to come.
David Kleidermacher is chief technology officer at Green Hills Software where he has designed compilers, software development environments, and real-time operating systems. He frequently publishes articles and presents papers on topics relating to embedded systems.