Making the Most of Multi-Core Processors: Part 1The speed race is over. Faced with the growing energy consumption and excessive operating temperatures caused by high CPU clock speeds, microprocessor vendors have adopted a new approach to boosting system performance: integrating multiple independent processor cores on a single chip.
Intel, for example, has proclaimed that all of its new CPUs will use multi-core architectures and has recently produced a roadmap that details processors based on two, four, and eight cores.
Until now, multi-core processors for the desktop and server markets have garnered the lion’s share of media attention. But multi-core is also taking root in the embedded industry, with the introduction of processors such as the dual-core Freescale MPC8641D, the dual-core Broadcom BCM1255, the quad-core Broadcom BCM1455, and the dual-core PMC-Sierra RM9000x2.
Multi-core processors like these are poised to bring new levels of performance and scalability to networking equipment, control systems, videogame platforms, and a host of other embedded applications.
Compared to conventional uniprocessor chips, multi-core processors deliver significantly greater compute power through concurrency, offer greater system density, and run at lower clock speeds, thereby reducing thermal dissipation and power consumption issues.
There is one problem, however: most software designers and engineers have little or no expertise in the programming models and techniques used for multi-core chips. Instead of relying on increasing clock speeds to achieve greater performance, they must now learn how to achieve the highest possible utilization of every available core.
Take, for instance, the challenge of managing shared resources in a multi-core chip. In most cases, the cores have separate level 1 caches, but share a level 2 cache, memory subsystem, interrupt subsystem, and peripherals (see Figure 1, below). As a result, the system designer may need to give each core exclusive access to certain resources (for instance, specific peripheral devices) and ensure that applications running on one core don’t access resources dedicated to another core.
|Figure 1 — Anatomy of a typical multi-core processor.|
The presence of multiple cores can also introduce greater design complexity. For instance, to cooperate with one another, applications running on different cores may require efficient interprocess communication (IPC) mechanisms, a shared-memory data infrastructure, and appropriate synchronization primitives to protect shared resources.
Code migration is also an issue. Most embedded OEMs have already invested in a massive code base created primarily for uniprocessor architectures. These companies need OS technology and development tools that help their code achieve the greatest possible resource utilization on multi-core hardware, with minimal porting effort.
The OS chosen for a multi-core design can significantly reduce, or increase, the effort required to address these challenges. It all depends on how the OS supports the various multiprocessing modes that a multi-core chip may offer. These modes come in three basic flavors: asymmetric, symmetric and bound.
Asymmetric multiprocessing, or AMP, provides an execution environment similar to that of conventional uniprocessor systems, which most developers already know and understand. Consequently, it offers a relatively straightforward path for porting legacy code. It also provides a direct mechanism for controlling how each CPU core is used and, in most cases, allows developers to work with standard debugging tools and techniques.
AMP can be either homogeneous, where each core runs the same type and version of OS, or heterogeneous, where each core runs either a different OS or a different version of the same OS. In a homogeneous environment, developers can make best use of the multiple cores by choosing an OS, such as the QNX Neutrino® RTOS, that natively supports a distributed programming model.
Properly implemented, the model will allow applications running on one core to communicate transparently with applications and system services (device drivers, protocol stacks, etc.) on other cores, but without the high CPU utilization imposed by traditional forms of interprocessor communication.
A heterogeneous environment has somewhat different requirements. In this case, the developer must either implement a proprietary communications scheme or choose two OSs that share a common infrastructure (likely Internet Protocol-based) for interprocessor communications. To help avoid resource conflicts, the OSs should also provide standardized mechanisms for accessing shared hardware components.
In virtually all cases, OS support for a lean and easy-to-use communications protocol will greatly enhance core-to-core operation. In particular, an OS built with the distributed programming paradigm in mind can take greater advantage of the parallelism provided by the multiple cores.
Inter-core IPC. Because AMP uses multiple OS instantiations, it typically requires a complete networking infrastructure to support communications between applications running on different cores. To implement the lowest level of buffer transfer, an AMP system may use I/O peripherals (for instance, Ethernet ports) assigned to each processor or use a shared memory/interrupt-based scheme, depending upon the system’s hardware capabilities. For heterogeneous operation, an AMP system will require standard protocols such as TCP/IP or possibly a clustering protocol like the Transparent Inter-Process Communications (TIPC) protocol.
In addition to standard protocols, an homogeneous AMP implementation may use proprietary or OS-specific protocols. For instance, by using the QNX Transparent Distributed Processing (TDP) protocol, developers can extend the native message-passing interface of the QNX Neutrino RTOS to implement remote communications over any lower-level interconnect.
With this approach, local and remote communications become one and the same: an application can use the same code to communicate with another application, regardless of whether the other application is on the local CPU core or on another core. Likewise, an application can access a device driver on a non-local processor as if that driver were running locally — this location transparency provides applications with easy access to hardware resources “owned” by other cores or processors.
Allocating resources. With AMP, the application designer has the power to decide how the shared hardware resources used by applications are divided up between the cores. Normally, this resource allocation occurs statically during boot time and includes physical memory allocation, peripheral usage, and interrupt handling. While the system could allocate the resources dynamically, doing so entails complex coordination between the cores.
In an AMP system, a process will always be scheduled run on the same core, even when that core is experiencing maximum CPU usage and other cores are running idle. As a result, one core or more cores can end up being over- or underutilized. To address the problem, the system could allow applications to migrate dynamically from core to another.
Doing so, however, can involve complex checkpointing of state information or a possible service interruption as the application is stopped on one core and restarted on another. Also, such migration is difficult, if not impossible, if the cores run different OSs.
Allocating resources in a multi-core design can be difficult, especially when multiple software components are unaware of how other components are employing those resources.
Symmetric multiprocessing (SMP) addresses the issue by running only one copy of an OS on all of the chip’s cores. Because the OS has insight into all system elements at all times, it can allocate resources on the multiple cores with little or no input from the application designer. Moreover, the OS can provide built-in standardized primitives, such as pthread_mutex_lock, pthread_mutex_unlock, pthread_spin_lock, and pthread_spin_unlock, that let multiple applications share these resources safely and easily.
By running only one copy of the OS, SMP can dynamically allocate resources to specific applications rather than to CPU cores, thereby enabling greater utilization of available hardware. It also lets system tracing tools gather operating statistics and application interactions for the multi-core chip as a whole, giving developers valuable insight into how to optimize and debug applications.
For instance, the system profiler in the Momentics development suite can track thread migration from one core to another, as well as OS primitive usage, scheduling events, core-to-core messaging, and other events, all with high-resolution timestamping. Application synchronization also becomes much easier since developers use standard OS primitives rather than complex IPC mechanisms.
Properly implemented, an SMP-enabled OS offers these benefits without forcing the developer to use specialized APIs or programming languages. In fact, developers have successfully used the POSIX standard (specifically, the pthreads API) for many years in high-end SMP environments.
Not only is POSIX is widely used and well-documented, but developers can write POSIX-based code that will run on both uniprocessor and multi-core chips. In fact, some OSs allow the same binaries to run on both processor types.
A well-designed SMP OS allows the threads of execution within an application to run concurrently on any core. This concurrency makes the entire compute power of the chip available to applications at all times. If the OS provides appropriate preemption and thread-prioritization capabilities, it can also help the application designer ensure that CPU cycles go to the application that needs them the most.
Inter-core IPC. Because a single OS controls every core in an SMP system, all IPC is considered “local.” This approach can improve performance immensely, as the system no longer needs a complex networking protocol to provide communications between applications running on different cores.
Communications and synchronization can take the form of simple POSIX primitives (such as semaphores) or a native local transport capability, both of which have much higher performance than networking protocols. This approach has the added benefit of a reduced footprint if a networking protocol isn’t required for other reasons.
Bound Multiprocessing (BMP)
BMP, a new approach introduced by QNX Software Systems, offers the benefits of SMP’s transparent resource management, but gives designers the ability to lock software tasks to specific cores.
As with SMP, a single copy of the OS maintains an overall view of all system resources, allowing them to be dynamically allocated and shared among applications. During application initialization, however, a setting determined by the system designer forces all of an application’s threads to execute only on a specified core.
Compared to full, floating SMP operation, this approach offers several advantages. It eliminates the cache thrashing that can reduce performance in an SMP system by allowing applications that share the same data set to run exclusively on the same core.
It also offers simpler application debugging than SMP since all execution threads within an application run on a single core. And it helps legacy applications written for uniprocessor environments to run correctly, again by letting them run on a single core.
With BMP, an application locked to one core can’t leverage other cores, even if they’re idle. That said, the OS vendor can offer tools that analyze resource utilization (including CPU usage) on a per-application basis and that suggest the best way to distribute applications across the cores for maximum performance.
If the OS also provides hooks to dynamically change the designated CPU core, then the user gains the freedom to dynamically switch from one core to other, without having to worry about checkpointing and stopping/restarting the application.
SMP has long had the capability of tying a particular thread to a single processor — this is known as thread affinity. BMP extends this thread affinity to the process level by incorporating the concept of runmask inheritance.
The runmask is a thread-level entity that determines which processors a thread can run on. In normal execution, threads are created with a runmask that allows them to execute on all processors. In BMP mode, however, all threads are created to inherit the runmask from the parent thread. This has the effect of “binding” all of the process’s resources and threads to the same processing core (or set of processing cores), giving the designer complete control over how a particular core will be used by the overall application.
A viable migration strategy
As a midway point between AMP and SMP, BMP offers a viable migration strategy for users who wish to move towards full SMP, but are concerned that their existing code may operate incorrectly in a truly concurrent execution model. Users can port legacy code to the multi-core processor and initially bind it to a single core to ensure correct operation.
By judiciously binding applications (and possibly single threads) to specific cores, designers can also isolate potential concurrency issues down to the application and thread level. Resolving these issues will allow the application to run fully concurrently, thereby maximizing the performance gains provided by the multi-core processor.
Scaling Software on a Multi-Core
To scale software effectively on a multi-core system, applications must be designed with concurrent operation in mind. Typically, applications written for most modern OSs today conform to a multithreaded, process-based model.
With this model, applications are broken down into processes that act as containers for resources, such as memory, virtual address space, stack, and so on. Within each process, the application is further broken down into threads.
A thread is the entity within the process that can be scheduled to run. A thread has configurable elements such as thread priority, which tells the OS how to run the thread in relation to other threads in the same or other processes.
In AMP mode, a process and all its threads are strictly tied to a single processor core. To distribute the application, developers must create and execute the process on each core and employ an appropriate communications protocol (for instance, TIPC or QNX TDP) to distribute, aggregate, and synchronize elements across the system.
Because of this additional complexity, developers may have to recode or re-architect their software when migrating to processors with a larger number of cores.
Because each OS scheduler in an AMP system contains thread information only for its own CPU core, threads are scheduled only in relation to other threads on that core. Consequently, a process that consumes 100% of a core can’t make use of CPU cycles from another core, even if the other core is completely idle.
Since multiple cores cannot share resources contained within a process, migrating the process from one core to another involves checkpointing, stopping the process, and restarting the process on the other core. This assumes that the application can even run on the other core, which may not be the case in a heterogeneous AMP system.
In SMP and BMP, individual threads are scheduled to run on all CPUs, and resources allocated to a process (and all of its threads) are available on all CPUs. This approach greatly simplifies the distribution, aggregation, and synchronization of an application — threads are created to run on each CPU as required, and communications between threads are handled by simple POSIX primitives (semaphores, mutexes, etc.) rather than by complex networking protocols.
Once designed, the process can run equally well on a single-core, dual-core, or N-core system, the only potential change being the number of threads that need to be created to maximize performance. In full SMP mode, an RTOS will schedule the highest-priority ready thread to execute on the first available CPU core. As a result, application threads can utilize the full extent of available CPU power rather than being restricted to a single CPU.
Evaluating typical multi-core
In the homogenous example shown in Figure 2, below, one core handles ingress traffic from a hardware interface while the other handles the egress traffic.
Because the traffic exists as two independent streams, the two cores don’t need to communicate or share data with each other. As a result, the OS doesn’t have to provide core-to-core IPC. It must, however, provide the realtime performance needed to manage the traffic flows.
|Figure 2 — Using homogenous AMP to handle both ingress and egress traffic.|
Figure 3 below shows another homogenous example, but this time the two cores implement a distributed control plane, with each core handling different aspects of a data plane. To control the data plane correctly, applications running on the multiple cores must function in a coordinated fashion. To enable this coordination, the OS should provide strong IPC support, such as a shared memory infrastructure for routing table information.
|Figure 3 — Using homogenous AMP to implement a distributed control plane.|
In the heterogeneous example shown in Figure 4 below, one core implements the control plane, while the other handles all the data plane traffic, which has realtime performance requirements.
In this case, the OSs running on the two cores need to support a common IPC mechanism, such as the Transparent Inter-Process Communication (TIPC) protocol, that allows the cores to communicate efficiently, possibly through shared data structures.
|Figure 4 — Using heterogeneous AMP for both the control plane and data plane.|
In the control plane scenario in Figure 5 below, SMP allows all of the threads in the various processes to run on any core. For instance, the command-line interface (CLI) process can run one core while the routing application performs a compute-intensive calculation on another core.
|Figure 5 — Using SMP in the control plane.|
As shown in Figure 6, below, a medical system is running in BMP mode on a quad-core processor, where one core handles data acquisition, another graphics rendering, another the human-machine interface, and the other database and data-processing operations.
As in SMP, the OS is fully aware of what all the cores are doing, making operational and performance information for the system as a whole readily available. This approach spares developers the onerous task of having to gather information from each of the cores separately and then somehow combining that information for analysis.
|Figure 6 — Using BMP in a quad-core medical system.|
In the control plane/data plane example shown in Figure 7, below, the control plane applications (command-line interface; operations, administration, and maintenance; data plane management) run on core 0, while the data plane ingress and egress applications run on core 1. Developers can easily implement the IPC for this scenario, using either local OS mechanisms or synchronized protected shared memory structures.
|Figure 7 — Using BMP for both control plane and data plane operations.|
A Matter of Choice
Should a developer choose AMP, SMP, or BMP? The answer depends, of course, on the problem the developer is trying to solve. It’s important, therefore, that an OS offers robust support for each model, giving developers the flexibility to choose the best form of multiprocessing for the job at hand.
AMP works well with legacy applications, but has limited scalability beyond two cores. SMP offers transparent resource management, but may not work with software designed for uniprocessor systems.
BMP offers many of the same benefits as SMP, but allows uniprocessor applications to behave correctly, greatly simplifying the migration of legacy software. The flexibility to choose from any of these models enables developers to strike the optimal balance between performance, scalability, and ease of migration.
In Part 2, the authors will deal with kinds of development tools which are needed to effectively do designs in complex multicore and multiprocessing environments.
Robert Craig, PhD, is senior software developer, Dennis Keefe is development tools manager and Paul Leroux is technology analyst at QNX Software Systems.
This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Silicon Valley 2006. Used with permission of the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.
To learn about this general subject on Embedded.com go to More about multicores, multiprocessing and tools.