Mulltiprocessor (MP) and multicore System-on-Chip architectures are now beginning to be employed in a wide range of embedded consumer and communications systems. They are perceived as a way to enhance performance in applications requiring execution of multiple tasks without the demands on power consumed in a single processor configuration cranked up to its highest clock rate.
Typical of such applications are a set-top box designed to record several channels while sharing home movies across the Internet, or an in-car computer system doing navigation tasks while at the same time delivering backset video gaming, anong others.
But MP designs introduce programming complexity that can make them difficult to use and can threaten development schedules. Better software tools are needed to make MP development easier, and included among them is the operating system, which – when properly implemented — has the unique capability of enabling an MP system to be programmed much like a single processor system.
But to make this possible, commercial providers of RTOSes – and the 50 percent or so of all developers who still build their own real time operating kernels – are faced with significant challenges. As the shift is made to more diverse, heterogeneous multiprocessing, the big questions facing developers include: what changes will have to be made to the applications developed for such environments, to the tools and the underlying OS structure.
It is clear that as designs with six, seven or more processing elements on a chip emerge, significant changes will have to be made. However, currently there is a large class of embedded applications in consumer electronics that allow the use of the well known symmetric multiprocessing (SMP) approach. Developed for server clusters and large computing applications, SMP does not require significant changes to the seasoned sequential, procedural single programming model.
When chosen with careful attention to real time performance, multithreading and real time interrupt response, even existing RTOSes can be used in such environments, While it will not be possible to achieve the theoretical maximum performance of an n-CPU device in a multiprocessor configuration (n*100%), significant incremental improvements are possible, ranging from 20% to even 90% under ideal conditions.
Multiprocessors can be configured in a variety of forms, from loosely coupled computing grids that use the Internet for communication, to tightly coupled shared resource architectures. In Symmetric Multi-Processors, or SMP., all processors (generally configured in pairs, or groups of 4) are identical and share a set of resources.
This model lends itself to certain programming approaches and RTOS facilities that can be exploited to benefit the developer. In particular, since all processors have access to the same physical memory, a process or thread can run on any processor from the same memory location. This is key to adapting an application to an SMP architecture.
ARM’s recently introduced MPCore is an example of such a multiprocessor (Figure 1, above ). The MPCore synthesizable multiprocessor, based on the ARMv6 architecture, can be configured to contain between one and four processors delivering up to 2600 Dhrystone MIPS of performance. The MPCore multiprocessor implements the company’s adaptive shutdown and intelligent energy management mechanisms to reduce power consumption by up to 85 percent.
As show in Figure 2 below , ARM’s MPCore can be configured with up to 4 processor cores in an SMP arrangement. It would be beneficial for applications to be portable across 1, 2, 3, and 4 processors without requiring tailoring to the specific number of processors to be employed. This would enhance portability from single processor implementations, and increase code re-use which in turn can greatly benefit time to market.
Picking the right RTOS configuration
The right RTOS configuration can help an embedded multiprocessor application in a number of ways. Traditionally an RTOS is used to handle interrupts and manage the scheduling of multiple threads, sequences of logically complete code that performs a particular role. A real time deterministic application generally is made up of multiple threads, each performing its intended function, scheduled according to priorities or in response to external events (interrupts).
Not all RTOSes, either commercial or custom roll-your-own, are created equal, however. At a minimum, a multiprocessor-capable RTOS targeted at the broad-range of multimedia intensive embedded consumer and mobile devices should offer an efficient interrupt handling architecture that delivers sub-microsecond Interrupt Response Times.
It also must be able to manage an unlimited number of threads at the same or different levels of priority, with deterministic scheduling performance. For advanced SOC configurations with limited on-chip memory, such RTOSes must be small enough to fit within on-chip memory for maximum performance, typically 32KB – 128KB of ROM/Flash in size.
In a multiprocessor system, a properly reconfigured traditional RTOS should also provide a number of other key functions:
1) Respond to variations in processing load, enabling all processors to be utilized during periods of maximum demand, without requiring explicit application assignment.
2) Adjust system rates to conserve power during periods of light demand.
3) Eliminate the need for multiple program threads specific to each processor, keeping the application hardware-agnostic, maximizing code re-use across multiple processors.
If the RTOS also has mechanisms for real time interprocesss communications (IPC), this will allow synchronization among the processors in an SMP environment, allowing the programmer to focus on the algorithm, not on the architecture or to which core particular code must be targeted.
A typical application domain that could be addressed by an SMP-enabled RTOS is that of handling a single, or multiple simultaneous streams of data, where substantial processing is required. This is the case with streaming audio, video, or multimedia data which requires compression/ decompression, encryption/de-encryption, rendering, filtering, scaling, and other CPU-intensive processing.
In a typical system, a cell phone for example, data might be DMA’d into buffers sized to suit the target display or other system characteristics. Once a buffer is full, an interrupt is generated to alert the application, and that buffer is targeted for processing while data streams into a new buffer. This particular model describes a continuous, but varying in intensity, data flow. In such a model, real-time response to data arrival is important, and overall throughput is the goal, but programming convenience is key.
How an SMP-capable RTOS does load balancing
One of the most intriguing aspects of delivering a performance increase from an SMP is doing load balancing without requiring programmer intervention. This not only makes it easier for the developer, more importantly, it also makes it possible to use legacy application code in an SMP system without modification.
One method for achieving programming transparency in a multiprocessor system is to configure the RTOS so that individual threads can be assigned to run on specific processors based on the availability of the processor.
This way, the processing load can be shared among processors with work automatically assigned to a free processor. The RTOS must determine whether a processor is free and if it is, then a thread can be run on that processor even though the other processors may already be running other threads. This enables a more complete utilization of resources, yet remains transparent to the number of processors, transparent to the programmer, and enables legacy code to be used intact.
In order to utilize such an approach, the developer must: (1) set up multiple, identical threads; (2) allocate threads to process portions of the data stream; and, (3) set priorities equal.
Priorities are important to consider, because the RTOS scheduler is designed to maintain priority execution of all threads, such that higher priority threads get executed before lower priority threads. This way, threads can safely assume that while they are executing, no lower priority thread can also be executing. The RTOS must preserve this rule even in the case of an SMP, or the underlying logic upon which a legacy application might be based could falter, and the application may not perform as intended.
Priority-based, preemptive scheduling uses multiple cores to run threads that are ready. The scheduler automatically runs threads on available cores. A thread that is “READY” can run on processor-n if and only if it is of the same priority as the Thread(s) running on processor(s)-(n-1).
After initialization, the RTOS scheduler determines the highest priority thread that is READY to run. It sets the context for that thread, and runs the thread on processor-1. The scheduler determines if an additional thread of equal priority also is READY. If so, that thread is run on processor-2, and so on. If no additional threads are ready to run, the scheduler goes idle awaiting an external event or service request, such as:
1) Interrupt causing preemption
2) Thread resume
3) Thread sleep or relinquish
4) Thread exit
Preemption occurs when a thread is made READY to run while a lower priority thread is already running. In this event, the lower priority thread is suspended (context saved), the higher priority thread is started (context restored or initialized), and any lower priority threads on other processors are suspended. This is critical to maintain the priority-order of executing threads.
Within this automatic load-balancing approach to managing the resources of an SMP like ARM’s MPCore, additional features are beneficial to overall performance.
One processor can be made responsible for all external interrupt handling (this does not include inter-processor interrupts needed for synchronization or communication). This leaves the other processor(s) with virtually 0-overhead from interrupt handling, enabling it (them) to focus all of its (their) cycles on application processing, even during periods of intense interrupt activity that otherwise might degrade performance
Putting load balancing to work
As an example, consider a system with two processors (Figure 3, above ) that is intended to handle a continuous stream of incoming data, such as streaming video. The data must be decompressed in real-time. Here is a typical data flow and processing model using an RTOS with automatic load-balancing support for an SMP:
(1) Input is set up to fill Buffer-1 in memory, with an interrupt generated upon a buffer-full condition (or based on input of a specified number of bytes). As Buffer-1 reaches a full condition, the following actions occur:
* Buffer-1 FULL generates Interrupt-1
* The ISR handling Interrupt-1 marks Thread-1 READY-TO-RUN
* The scheduler runs Thread-1 on Processor-1
* Data is directed to Buffer-2
* Processor-2 remains idle.
Then, (2) more data arrives while Thread-1 is still active, and Buffer-2 fills up, during which:
* Buffer-2 FULL generates Interrupt-2
* The ISR handling Interrupt-2 marks Thread-2 READY-TO-RUN
* The scheduler runs Thread-2 on Processor-2
Where do we go next?
Obviously, multicore and multiprocessing applications are going to become much more complex and will require RTOSes much more adapted to their needs. But that does not mean, for SMP environments at least, that we must go to a system architecture much different than the one we depend on now.
There are a number of things that can be done to the underlying interrupt and thread mechanism of an RTOS that can make it much more useful in a multiprocessing environment, without changing the well-known programming model on which it is based and which will provide an easy migration of legacy (single processor) applications to the SMP.
In particular, as the shift is made from relatively simple dual-processor designs, it may be absolutely necessary to redesign the underlying data structure of the RTOS if we are to retain the relatively straightforward SMP programming model.
In the load balancing example described earlier, it may be necessary to create a dimensionalized data structure in order to keep track of multiple data array objects. In a single processor model there might be 10 each of mutexes, memory pools, event tags, semaphores and queues, a not insignificant management chore. In a dual processor model, it would be necessary to manage 10×2 of each object, 10×3 in a three processor design, and so on as the number of processors increase.
In this environment, ironically, the game would not go to the most sophisticated and feature-rich RTOS, but to the leanest and the one with the best real time interrupt responses. Depending on how it is implemented, such a dimensionalized data structure could require anywhere from 10 to 50 percent additional code space and would have the same impact on interrupt response.
For an RTOS with interrupt responses in the sub-microsecond range and requiring only 32 kilobytes, in absolute terms even 50 percent is not a big hit. For a Linux design, however, requiring 1-2 Mbytes and with response times in the tens of millisecond range, even a 10 percent increase in code size and similar reduction in performance would rule it out in many multimedia intensive embedded consumer and mobile designs using more than one processor.
Thus, while Linux today provides SMP support for some architectures, its size and performance may restrict its use to memory-rich applications requiring less demanding or soft real-time response. For other needs, both small footprint and faster real-time response, a suitably adapted RTOS might be more appropriate.
John Carbone is vice president of marketing at Express Logic, Inc.
To read more technical insights and how to articles on multiprocessing and multicore designs, go to More about multicores, multiprocessing and tools