Putting Multicore Processing in Context: Part 2

Todd Brian

March 07, 2006

Todd BrianMarch 07, 2006

Is it a given that utilizing multicores will result in a speedup of an application? Amdahl’s Law is not the only thing that plays a role in the speedup of an application.

In general, if speedup is the sole objective when adding multiprocessors, the following must hold true: (1) the processor is overloaded and is not processing the work available in a satisfactory time frame; (2) the workload contains elements that can be divided and worked on in parallel; and, (3) a suitably faster processor cannot provide the processing power needed to handle the workload in a satisfactory time.

Part 1 in this series examined the “classic” reasons why one does not get a proportional increase in performance by adding additional processors to a computing machine. Most, if not all of them, were based in some form or fashion on Amdahl’s Law.

Basically, Amdahl’s Law states that the upper limit on the speedup gained by adding additional processors is determined by the amount of serial code that is contained in the application. Some of the reasons for serialized code are that it is explicitly written into the code. Another reason why the code becomes serialized is because the code shares resources. This includes data sharing. Only one processor or core can access shared data at a time.

The next step in the exploration of multicore processing and whether or not it will be of benefit in your application is the hardware. Most embedded designs use shared memory (all cores are able to access some or all of the memory on the chip) and they have the capability to communicate with each other in some fashion. For most applications, the addition of more cores does not lead to a proportional increase in performance.

It is how the cores are combined and how they utilize memory and other resources and their communication topologies that differentiate the architectures into the various classes of multicore architectures. From a hardware perspective, there are several different standard approaches used in designing multicore hardware.

I use the term “standard approaches” because there have been many different ways that cores, memory, resource access and communication mechanisms have been combined. Luckily, for the embedded space, most of that experimentation was addressed long before embedded designs incorporated multiple cores.

This article looks at two of the most commonly used multicore designs for the embedded space: Symmetric Multi-Processing (SMP) and Asymmetric Multi-Processing (AMP).

SMP Hardware Architectures
SMPs are characterized by the symmetrical nature of their organization and layout. They utilize two or more identical (homogeneous) cores that have access to a common shared memory. Another attribute of SMP architectures is that not only are each core identical, but each core has identical access to all the resources of the system: memory, disk, UARTs, ethernet controllers, etc. Analog Devices’ Blackfin 561 and ARM’s MPCore are just two examples of (SMPs).

SMPs are a cost-effective way of increasing performance in a system, rather than replicating the entire system: core, memory, IO and other resources. Multicore designers realize that the most heavily used component of a processor is the core. All other resources of a processor are idle, relative to core activity since multiple resources can not be utilized concurrently. As a result, the odds are that different cores can operate in parallel by utilizing different resources and data streams.

Figure 1: Symmetric Multiprocessing Hardware

Asymmetric Hardware Architectures
AMPs are characterized by the non-symmetrical nature of their architecture. In other words, AMP architectures are not bound by the SMP rule which states that all cores used in it are identical and do not require equally shared resources by each processor. Similar to SMP-based architectures, AMPs seek additional perfomance gains by adding multiple cores.

Unlike SMP architectures, AMP architectures seek additional performance gains by utilizing different cores or hardware configurations that are optimized to do very specific activities. Texas Instruments’ OMAP and Freescale’s i.MX family of application processors are two examples of a homogeneous AMP architecture. The OMAP and i.MX families combine a general purpose MCU with a digital signal processing (DSP) core.

DSPs are capable of doing highly specialized mathmatical operations very efficiently when compared to a general purpose MCU. What may take a general purpose MCU hundreds of cycles to do, a DSP can accomplish in only a few cycles. By combining the cores in this fashion, designers not only provide for the division of labor among cores, but also, like Adam Smiths’ pin manufacturing example given in the first article, practice the concept of labor specialization. Due to the specialized nature of an AMP architecture, its area of application is more narrow than a more general purpose SMP architecure.

Because of the complex nature of the devices for which multicores are an appealing solution, many multicore devices use a real-time operating system (RTOS) to provide operating system services such as scheduling, communication and task management. RTOSs suitable for multi-core architectures are as varied and specialized as the architectures they are geared toward. In general, they can be segmented into two camps, those suitable for SMP architectures, and those suitable for AMP-based architectures.

Figure 2: Asymmetric Multiprocessing Hardware

Real-time operating systems are regarded as AMP or SMP because they exploit the different hardware attributes of AMP and SMP architectures. Since all the cores or processors that make up the SMP architecture are exactly the same, SMP RTOSs are characterized by the use of a single (instance) of an RTOS image that runs on all the different cores at the same time. SMP RTOSs perform what is called “load balancing.” Load balancing involves parceling out ready-to-run tasks to available (idle) processors.

Spin locks and SMP: a review
Since SMP architectures are based on equal access to resources, resource protection is a fundamental requirement in SMP systems. As will be demonstrated in the next few examples, “Spin Lock Granularity” is of supreme importance in its effect on the efficient operation of an SMP system and is worthy of further examination. The primary mechanism for maintaining coherency in an operating system as well as its applications and data is the spin lock.

Spin locks are a logical abstraction that act as gatekeepers to resources such as shared data and tables, controllers and kernel services such as the scheduler. They are typically implemented as “atomic” test and set locks. In other words, the first task to gain access to the spin lock gets it and keeps the spin lock until the task is finished with it. Special hardware arbitrates between two or more tasks that attempt to gain access to a resource simultaneously. Tasks that do not get access will “spin” until the resource becomes available.

System and application designers have to balance two conflicting goals when determining the granularity of spin locks. From one perspective, the more spin locks a designer uses to protect resources, the more that these resources may be utilized in parallel. The downside of this is that as the granularity of the locking scheme grows, so does the overhead associated with maintaining the locks. Furthermore, as the locking scheme gets finer, the opportunity increases for a deadlock to occur. The addition of anti-deadlock algorithms also adds to the overhead.

Since lock contention tends to serialize execution, it is intuitively obvious that the shorter the amount of time a lock is held, the less the potential for serializing to occur. The following techniques are used in both RTOSs and multicore programming to reduce the time that a lock is held.

Table 1

Symmetric OSes: master/slave and the alternatives
Primitive SMP RTOS implementations usually have a single processor that is designated as the master and activities like interrupt handling, resource arbitration and scheduling are performed on it. The master processor is statically defined or determined at boot. In more advanced SMP implementation, load balancing is used for interrupt and scheduling tasks and responsibility floats among the various processors. Load balancing RTOS-related tasks provide an ideal platform for “high availability” or “hot swap” capabilities.

SMP RTOSs that use the master/slave approach are the easiest to implement but provide the least performance increase of all the different SMP implementations. One reason for this is that the master acts as a bottleneck to providing kernel services. As with any other resource in a SMP RTOS, the kernel services must be protected from simultaneous access by different process demands.

One solution involves running the kernel in spin-lock mode, where only one kernel service at a time is serviced (for example, a memory create). Other processors seeking concurrent access to kernel services (for example, to the file system) have to wait until the first process is finished. Although the system services are protected from competing processes, it operates inefficiently.

An improvement on the basic master/slave approach would be to allocate the different kernel services to different spin locks. For example the scheduler, interrupt controller and file system would each have a spin lock. By increasing the granularity of the kernel protection data structures, the increased granularity allows the kernel to service more tasks concurrently. Now, only processes that compete for identical kernel services will spin, waiting on a lock to be released.

A superior solution to the two implementations presented above is an SMP RTOS, that in addition to “load balancing” different tasks, also load balances kernel services. By threading the kernel services and allowing them to run in parallel on different microprocessors, the possibility of an unavailable kernel service decreases dramatically.

Table 2

There are other less obvious advantages to using SMP. Design teams typically focus on a single problem domain (task) at a time. Segmenting the problem domain into tasks is not only a natural approach from the human problem solving perspective, it is the same one used in the uni-processor world. Since this is how most company’s design teams work, there is no need to re-organize an engineering team. Designing a threaded program is also a familiar technique used in the uni-processor world. No new paradigms are needed to utilize SMP effectively.

Compiler technology can also be applied to problems that benefit from decomposing the problem into smaller discrete parts that can be processed independently from each other. An example of this would be array processing where the array can be decomposed so that parts may be attacked independently of each other. It is unlikely that the embedded arena will benefit from this type of solution anytime soon as it is typically applied to architectures that have dozens if not hundreds of processors.

/Table 3

RTOS strategies for asymmetric multiprocessing
Unlike SMP RTOSs, RTOSs for AMP architectures do not require that the hardware be symmetric or asymmetric. The primary characteristic of an AMP RTOS is that it only runs on a single core or a single processor. Put simply, an AMP solution requires one RTOS image for each core in the design. An AMP RTOS is the one that everyone is familiar with and has used in the embedded space for the last 50 years. It could be a “roll your own” or a commercial RTOS. Regardless, the RTOS image is compiled for that core and only sees the resources that the designer dictates.

The fact that AMP-targeted RTOSs don’t require a symmetrical hardware architecture, and that each core may or may not have access to different resources makes it a very flexible in terms of how the total system can be put together. This singular characteristic makes it a very good candidate for use in a number of situations.

For heterogeneous architectures like the OMAP platform where one core is a DSP and the other is an ARM microcontroller unit (MCU) core, the AMP solution is the only one possible. One RTOS image is compiled for and run on the DSP and the other is compiled for and run on the MCU. Just like the hardware designers’ ability to provide cores that have specific functionality, software developers deploy a mixture of RTOS functionality for each core.

As is the case of the OMAP platform, the DSP cannot see most of the system resources like network connections and storages devices; it only exists to crunch numbers. Therefore, developers of DSP applications rarely require anything beyond rudimentary RTOS services such as a scheduler and the ability to create and delete tasks.

The MCU developer on the other hand may need support for a man-machine interface, GUI, file system, memory management and networking and communication protocols. To minimize the combined footprint of the combined RTOSes, each one may be scaled so that it provides only the functionality needed by the applications software running on each core.

There is no requirement that AMP-based RTOSs have to run on hetrogeneous architectures. There are a number of situations where a developer chooses to use an AMP operating system over an SMP implementation.

One example would be in the case where deterministic real-time deadlines have to be met. Unlike SMP-based solutions where the scheduler and interrupt handling mechanism may be shared amongst multiple cores, an AMP implementation, with its own scheduler and interrupt handling mechanism on each core, can respond to interrupts and deadlines without waiting to acquire a lock for a resource.

The developer can mix and match their RTOSs and hardware to provide an optimal mix of performance and features. For example, the developer can choose an optimized deterministic, real-time RTOS for the cores that respond to the real-time needs of the application, and RTOSs that offer a rich palette of services. Developers can also choose third-party software for the non-real-time aspects of the system.

In an effort to reduce costs associated with software safety certifications, developers can partition the system so that only the bare minimum code pertaining to the safe operation of the device runs (on one or more cores), and all the “non-safe” software can run on the other(s). While it may not make a difference for most applications, for those that do, the final cost of developing code to meet FAA, FDA and industrial standards starts at $60 to $100 per line of code.

Partitioning and legacy code issues
Partitioning the system is a viable way of reducing overall costs of delivering a safety certified system. Another reason is the case where large amounts of legacy code are developed with a specific RTOS in mind. In this case, it is easier to reuse the total application intact, rather than port it to a newLegacy Code <> RTOS. Furthermore, it would take a significant investment in resources to determine the different interactions between the reused legacy code and the new code in a load-balancing system.

AMP solutions provide the designer with total control over how their system functions. AMP solutions leave nothing to chance. Each process is tasked to a core and has a priority assigned to it. Since no kernel resources are shared, AMP provides very deterministic behavior. If one word could describe an AMP solution, it would be "predictable.” The designer binds each task to a processor at compile time. If tasks running on different cores need to interact, the designer governs how, when and where they will interact.

One thing that SMP software developers take for granted is the ability to communicate between the different cores. Typically, AMP-based solutions have no built-in mechanism to provide a method for communication and synchronization between the two cores.

Since many of the multicore hardware solutions for the embedded space use shared memory, a very efficient inter-processor communication mechanism can be utilized for inter-processor communication (IPC) between the two cores. A basic implementation would provide for data marshaling if needed, a table and set of message buffers, a mechanism to protect and synchronize access to the message buffers and a method to signal that the other core can access a message buffer.

A last word about tools
One final aspect of multicore development is the availability of tools for the RTOS and the hardware target. Debuggers, like multi-core hardware and software solutions vary in sophistication. SMP debuggers need to be able to determine what tasks have been allocated to different cores to run. The inability to set and trigger breakpoints across multiple cores makes finding and solving some bugs very difficult. Choosing the development environment use for multicore development can be as important a decision in the success of a project as the RTOS and hardware platform. The next article will address the topic of multicore development tools in detail.

To read Part 1 in this series, go to The Pros and Cons of Multicore Architectures

Todd Brian, is product marketing manager in the Accelerated Technology group at Mentor Graphics Inc.

To learn about this general subject on go to More about multicores, multiprocessing and tools.

For further information about upcoming activities in the industry relating to multicore design, go to the Multicore Association Web site.

Loading comments...