The goal of engineering is to do things as efficiently as possible, and there is especially true in embedded systems. As engineers, we are tasked to make the impossible, possible. Do things with less power, in less time, or a higher bandwidth, or greater security. These are the driving forces in our industry that keep us moving forward.
Multicore as an architecture that is disruptive yet a logical extension to the single-core systems is no different. There are some key benefits moving to multicore where we can take advantage of several cores to parallelize the application. But there are cases where multicore, if done improperly, can actually slow down the processing and make the move to multicore worthless.
Here we will get into the architectural choices from the hardware capabilities, to the software choices including multiple OS’s, and how to make them efficient for your design. We will get into working both in software and hardware that can make the challenges understood. We will look virtualization technologies and necessary OS choices for both Symmetric and Asymmetric operating environments. Lastly, we will get into tuning of a multicore system so we can assess just how well the system is performing under a load.
What’s in the hardware?
The SoC (System on a Chip) has been growing in complexity because of two main factors. First, the cost of the logic to put more transistors in an SoC is decreased. SoC’s balance the needs to size, weight, power, and speed to become the modern workhorse in embedded systems. No longer is it customary to integrate discrete components for the most common of interfaces.
The SoC's have encapsulated many different peripherals into themselves. Additionally modern SoC’s are overloading the pins and putting in many more peripherals than there are pins, and the system must initialize all the interconnects to bring out the right peripherals to the right pins, and route the interrupts to the right place. This means by necessity that that the software is more complex.
There are two main trends in multicore systems and they are Symmetric and Asymmetric Multiprocessing or SMP and AMP for short. SMP has the advantage that with all the cores the same, you can write the code to run on one core and the code will run on all the SMP capable cores.
AMP being a much older technology – yet still relevant today, allows for dedicated processors like a Digital Signal Processor (DSP) core for signal processing alongside, a general processor for the more nominal features of an SoC. Both are valid, and coexist quite nicely. We will look at how to take advantage of both even on SMP hardware.
Shown in Figure 1 below is an SoC with a quad core Cortex-A9 SoC. As you can see the complexity in SoC is more than just the multiple CPU’s. There are many IP blocks for dedicated purposes like connectivity, System control and graphics acceleration, power management and security.
We will focus on how to use the CPUs in the multicore/multiOS paradigm. With 4 cores at our
disposal, we can think of using them as a single OS domain. One scheduler will prioritize the tasks and interrupts onto all 4 cores.
The OS will take care of the details adding just a bit of complexity to handle multiple threads to be scheduled simultaneously by adding a spinlock. The OS itself will initiate the system on core 0 and bring the system up.
Once the system is fully operational, the scheduler will engage each core to start processing. Each core will run the schedule and pick out a thread to run. The OS itself can execute on all cores scheduling from the same free thread list, but adding the concept of a spinlock.
A spinlock, like a semaphore from the single CPU scenario, is used to protect critical regions for multiple tasks from accessing the same block of data at the same time. Where a semaphore will allow a task to access a critical region, and block all the other threads from accessing it, a spinlock keeps the threads executing on the blocked CPU core in a tight loop waiting for the resource to be free.
For the scheduler having the ready list of threads to execute, the spinlock protects that region. If all 4 threads access the same block at the same time. One will be granted access, and the others will spin a very tight loop to see when the resource is free. When the task that has the resource releases the spinlock, the next task will get it keeping the others spinning until they can complete the work. This works extremely well for very short critical regions. The application code utilizes semaphores for synchronization and deep underneath the semaphore object is a spinlock to keep things synchronized across all CPUs assigned to the OS domain.
Taking code from single core environment to SMP multicore means that several threads and/or interrupts can and will execute at the same time. This race condition will never happen on a single core system, but it is a certainty that you have a potential race condition, and they must be taken care of. But where are the race conditions?
One way to assess the race conditions even on a single core is to inspect the code for regions of code and data objects that should be protected, but are not. In a single core environment with a preemptive, priority based scheduler, the highest priority ready task runs. Interrupts may fire, but only when not in a system critical region.
Because there is priority-based hierarchy to the system: Highest interrupt signal handled first through all the interrupts, then the highest priority thread through the lowest priority thread, there is an inherent pecking order. One way to test and potentially uncover race conditions is to invert the pecking order as much as feasible and see that the application still functions.
While it may not be as efficient, the system should still function. Another method is to time-slice all the threads at the same level, with a very short slice. This is no guarantee that you will find them, but it may cause the race condition to occur on a single core system. Inspection is still the place to start.
If the code clearly identifies any global structures that are modified outside of its local stack, and that structure will be access by more than one task, then placing semaphores around those structures is a must. It is also important to use different semaphores to protect different data structures instead of using a single semaphore for multiple data structures.
While it will function, it will unnecessarily gate the system and block threads that are not depending on the same structure. This is common oversight on complex systems where there is no clear understanding of who accesses what and where. Refactoring the code may be necessary to streamline for performance. This will enhance the system’s ability to fully utilize all the cores at its disposal.AMP scheduling
An SoC that consists of different types ofCPU’s, will necessitate different OS domains, like a DSP (Digital SignalProcessor) and a general purpose processor, they will not run the samecode. But it is also possible to divide up SMP capable CPU’s intodifferent OS domains. These domains will keep the scheduling entitiesseparate, and eliminate the problems inherent to moving to SMPmulticore. In each OS domain, the system is only responsible forscheduling across its set of CPU’s. If that is a single CPU, you haveeliminated the race condition.
There are advantages to usingmultiple OS domains across a set of SMP capable CPU cores. Code that isnot ready for SMP, will run safely on a single core. Or code that needsto be secure or has safety critical aspects can exist in its ownseparate OS domain thereby minimizing the potential for corruption. Thishas the added advantage of minimizing the code that needs to becertified.
Multiple OS domains don’t exist in a vacuum. Theymust interact. Depending on how much data and how closely they arecoupled, the can interact with shared memory via an IPC(Inter ProcessCommunication) model. This method allows the sharing of data structuresacross operating system domains in order to pass data and commandsbetween the domains. While the shared structures need to be protected,using a common IPC mechanism contains and constrains the system makingit dependable and predictable. It does add some latency as datastructures may need to be copied into and out of shared memory regions.
As shown in Figure 2 below, the combination of both SMP and AMP style is compelling to splitthe load, taking advantage of all the cores and optimized for dataaccess and separation. This allows the cores to be divided up betweenthe applications in ways that will meet the needs of the system. Keepingsome cores together to utilize SMP and breaking off other cores forAMP.
Even if the entire system can be made SMP-safe, the issues with cache coherency will have a
diminishingreturn with a greater number of CPU’s. Take an extreme example of 64cores all dependent on the same OS domain and underlying cache. If eachthread of execution is accessing similar data structures, and thethreads are interdependent, then there can be a degradation ofperformance instead of the expected increase in performance.
Dividingup the system into functional blocks allows the application to bepartitioned by OS domain and yet spread across the available cores inthat domain. Minimizing any interdependencies between OS domains butallowing the flow of information as necessary.
A Hypervisor can be deployed to take care of fixing the boundaries between the OS domains in
orderto ensure that one OS domain cannot negatively impact another OSdomain. This adds a level of complexity but where there is hardware theassist, the impact is minimal and each OS can be unaware that it is onshared hardware. If it attempts to access memory that it does not own,the hypervisor will block the access leaving the offending OS with anexception, and the other OS domain unharmed.
If the SoCarchitecture does not have the necessary hypervisor support, it canstill be done, but the OS itself must be “Para-virtualized” in order togain the necessary access between the hardware and the OS.
Example: Subsystem integration
Letus assume that we have 4 functional subsystems in the previous versionof the product, each one with its own CPU(s) that manage a particularsubsystem. Combining all the processing onto a single SoC with 4 coreswe can consider both SMP and AMP for the solution.
Option 1: Each subsystem gets its own OS domain and each domain contains a single CPU. The
systemwill employ an IPC mechanism in order to communicate between domains.In this scenario the code will run as it did in the subsystem before,but only one SoC is necessary to process all the subsystems. Since theyare all single-core domains, the code does not need to be mademulti-core safe.
Option 2: Combine all the functionalityonto a single OS domain that spans all the available cores. In thiscase, we can utilize all the cores for the highest priorities across allthe subsystems. The code will need to be made multi-core safe, but willbe more efficient than dedicating their use as in Option 1. Thisassumes that a system is not cache access limited.
Option 3: Combine 3 of the subsystems into one OS domain for efficiency leaving asingle subsystem to be implemented on a single core OS domain. Thisallows the system to balance the needs of the 3 domains, and keep adedicated core for the one domain where the tasks are mission critical,cannot be split up, or need to be certified.
While I detailedthe three options, it is really up to the system architecture and thefuture implementations of the system that will suggest a path forward.There may be some transitional elements to the architecture assubsystems are made multi-core safe.
We identified the different cases of SMP, AMP and the hybrid approach available in modern
multicoreSoC’s. There are systems that can utilize SMP if they follow the rulesto determine any race conditions that may occur. Then by using operatingsystem constructs like semaphore and spinlocks in order to eradicatethe race conditions.
If you cannot combine the subsystem domainsonto several cores, then AMP is still a viable option even on SMP cores.The ability to mix and match OS domains gives the system dexterity andeases the migration from AMP in distributed systems to an appropriatesystem architecture.
Adding a hypervisor has benefits and may bealmost invisible to the OS itself. There may be some reasons topara-virtualize the OS in order to make it as efficient as possible. Butusing a hypervisor can enhance the systems safety and security.
Inthe end the tools matter, finding bottlenecks and race conditions arenot trivial. Printf’s don’t always cut it, and using tracing tools maybe the only way you can detect problems and inefficiencies.
Stephen Olsen is currently a Product Line Manager for VxWorks at Wind River. Prior toWind River, Stephen was involved with Mentor Graphics as a consultant,system architect, and RTOS engineering manager. He co-chaired VSIA’sHardware dependent Software (HdS) design working group, worked on theMRAPI specification for the Multi-core Association and authored severalpapers on system design, USB, multicore/multi-OS design and powermanagement. He was awarded a patent on debugging hardware acceleratedoperating systems.
This paper was as part of a class (ESC-203) that he taught at the Spring Embedded Systems Conference at DESIGN West.