Implementing dual OS signal processing using Linux and the DSP/BIOS RTOS
By Steve Preissig, David Beal, and Aurelien Jacquiot
Embedded.com
(05/20/08, 12:42:00 AM EDT)
The classical trade-off between system performance and ease of programming is one of the primary differentiators between general purpose and real-time operating systems.

GPOSes tend to provide a higher degree of resource abstraction. This improves application portability, ease of development and increases system robustness through software modularity and isolation of resources.

This makes a GPOS ideal for addressing general purpose system components such as networking, user interface and display management.

However, this abstraction sacrifices the fine-grained control of system resources required to meet the performance goals of computationally intensive algorithms such as signal processing code. For this level of control, developers typically turn to a real-time operating system (RTOS).

From an embedded signal processing stand point, there are essentially two types of OSes to consider: Linux, a general-purpose operating system, and DSP/BIOS, a real-time operating system. Linux offers a higher level of abstraction while the DSP/BIOS provides finer control.

In order to leverage the strengths of both alternatives, developers can use a system virtual machine, which allow that allow programmers to run Linux and DSP/BIOS concurrently on the same DSP processor.

(Editor's note: Unlike process virtual machine environments specific to particular programming languages, such as the Java VM, system virtual machines correspond to actual hardware and can execute complete operating systems in isolation from other similar instantiations in the same computing environment.)

An important question to ask however, is why not simply use a CPU+DSP combo running Linux and DSP/BIOS separately? CPUs are, after all, more efficient at running control code for user interfaces, etc. And separate cores avoid the overhead associated with virtualization. However, putting all functionality onto one chip is attractive for several reasons.

For one, today's high performance DSPs are much more powerful than previous generation DSPs. This frees up more cycles for control processing. In addition, most high-performance DSPs are more general-purpose than they used to be, allowing for more efficient control code processing.

If all functionality can fit on a DSP, the benefits are compelling. One less chip translates to lower cost and area, as well as lower energy consumption because power hungry interprocessor data transfers are eliminated.

Scheduling
One of the most beneficial and commonly used aspects of any operating system is the ability to concurrently execute multiple tasks or threads. The operating system employs a scheduler to manage the processing core in order to serially order tasks for execution.

A historical concern of embedded programmers when using Linux was the lack of real-time performance. However, recent improvements to the Linux kernel have greatly improved its responsiveness to system events, making it suitable for a broad class of enterprise, consumer and embedded products.

Linux provides both time slicing and priority-based scheduling of threads. The time slicing methodology shares processing cycles between all threads so that none are locked out. This is often useful for user interface functions to guarantee that if the system becomes overloaded, responsiveness may slow, but no user functions are completely lost.

Priority-based thread scheduling, on the other hand, guarantees that the highest priority ready thread in the system executes until it relinquishes control, at which time the next highest priority ready thread begins executing.

The Linux kernel re-evaluates the priorities of ready threads upon each transition from kernel to user mode. This means that any new kernel-evaluated event, such as data becoming ready on a driver, can trigger an immediate transition into a new thread (within the latency response of the scheduler). Due to the determinism of priority-based threads, they are often useful for signal processing applications where real-time requirements must be met.

Prior to version 2.6 of the Linux kernel, the main detraction to real-time performance was the fact that the Linux kernel would disable interrupts, in some cases for hundreds of milliseconds.

This allows for more efficient implementation of the kernel because sections of code do not need to be made reentrant when interrupts are disabled but adds latency to interrupt response.

Now with version 2.6, a build option is available that inserts much more frequent re-enabling of interrupts throughout the kernel code. This feature is often referred to in the Linux community as the preempt kernel, and while it does degrade performance of the kernel slightly, it greatly improves real-time performance. For many system tasks, when the preemptive Linux 2.6 kernel is used with real-time threads, it will provide sufficient performance to meet real-time needs.

For instance, the Texas Instruments DSP/BIOS supports only priority-based scheduling, in the form of Software Interrupts and Tasks. As with the Linux scheduler, these Software Interrupts and Tasks are preemptive. However, DSP/BIOS also provides application programmers with direct access to hardware interrupts, a resource that is only available in kernel mode in Linux.

Direct access to hardware interrupts allows application programmers to achieve the theoretical minimum latency response supported by the underlying hardware. For applications such as control loops where the absolute minimum latency is required, this fine grained control over hardware interrupts is frequently a valuable feature.

Protected Access to Resources
A fundamental property of Linux and most general-purpose operating systems is the separation of user-space programs from the underlying system resources that is utilized. Direct access to memory and device peripherals is permitted only when operating in supervisor (i.e. kernel) mode.

When a user program desires access to system resources, it must request them from the kernel through kernel modules called drivers. The application exists in a user memory space and will accesses the driver through virtual files. The virtual files then translate the application's requests into the kernel memory space in which the driver executes.

Linux provides an extremely feature-rich driver model that encompasses standard streaming peripherals, block storage devices and file systems, and even networking and network-based file systems.

The separation of these drivers from the user-space application provides robustness. Furthermore, the abstraction to a common driver interface makes it easy to stream data to a serial port, to a flash file system or to a network shared folder " all with little change to the underlying application code.

This flexibility, however, comes at a price. The strict separation between applications and physical resources adds some degree of overhead. When a user space program accesses a device peripheral, a context switch must be made into kernel mode in order to process the request.

Typically this is not a significant limitation because the data is accessed in blocks as opposed to sample-by-sample, so the context switch into kernel mode needs to be made only once per block access.

There are cases, however, when application code requires a tight coupling with physical hardware. This situation occurs frequently when using high-performance processors such as DSPs where data throughput is a key element to processing without stalls. In these cases, the separation of physical resources in kernel space from the application in user space may be a significant detriment to the system.

Coupling of Application and Hardware
Let us examine a typical situation encountered when performing block video processing using the TMS320DM643x processor architecture, which incorporate a 600 MHz / 4800 MIPS DSP processing core and a wide range of multimedia peripherals, including a feature-rich video port sub system. A typical application of this hardware would be the compression of an incoming video stream using H.264.

In order to take full advantage of the processing capability of the DSP core, processed data should be accessed from single-cycle internal memory as opposed to slower external memory. Although it would be technically possible to enable the processors with enough fast on-chip memory to store one or more full video frames, this approach would be cost prohibitive to most target markets. Instead, the processor provides 80 Kbytes of single-cycle on-chip data memory .

While small relative to a full frame, 80 Kbytes has been determined by TI through simulation to give the optimal area/performance tradeoff for H.264 and other video processing algorithms.

To keep this memory fed with data, the DSP uses a Direct Memory Access (DMA) controller, which can also be utilized to efficiently transfer sub-blocks of data between external and internal memory without using cycles from the processing core (Figure 1 below).

Figure 1. DSP Processor utilizes DMA hardware to transfer small sub-blocks of a video frame in external memory into internal memory to be processed by the DSP core.

From a whole-system perspective, this method can provide nearly the same performance as a chip with an entire video buffer but at a fraction of the cost. To achieve this performance, however, requires a very tight coupling between the application, the operating system and the underlying memory and DMA hardware.

First, the application must have a means of distinguishing between fast internal memory and bulk external memory. Second, the application must be able to execute many small, precisely-timed DMA operations. Since all latency incurred when accessing the DMA is magnified by hundreds or possibly thousands of DMA accesses per video frame, efficient performance of these DMA operations within the Linux driver model is difficult, if not impossible, to achieve.

Practical implementations of this method have been demonstrated utilizing DSP/BIOS, providing native APIs to allow applications to request internal versus external memory. This also allows applications to access DMA registers directly with no context switching penalty.

The Best of Both Worlds
Although many multimedia applications spend the majority of their processor cycles on signal processing, there are many higher-level functions that a consumer-ready product must implement. User interface and display functions, networking and file manipulation are just a few.

Because these features are not time critical, the fine-level control of DSP/BIOS is not required. Here, the resource abstraction provided by the Linux driver model is preferred for the benefits of greater flexibility and reduced development time, not to mention the wealth of open-source application code available in the Linux community.

A solution in which the Linux and DSP BIOS operating systems run concurrently on the same device involves the use of a virtualizer  to provide the system developer or integrator with advantages of both systems (Figure 2 below).

Figure 2. Linux and DSP/BIOS running concurrently on a DM643x DSP Device

The virtualizer acts as a fast and predictable switch to share DSP resources between Linux and DSP/BIOS operating systems. It guarantees the best possible performance for DSP/BIOS threads by making a speculative switch to the context of the DSP/BIOS operating system whenever an interrupt is received.

If the newly arrived interrupt corresponds to an event recognized within the DSP/BIOS context, it will be handled within the DSP/BIOS context, which is already loaded and ready to run.

While the virtualizer is DSP/BIOS enabled, the application is given direct access to needed system resources without affecting the user and kernel spaces maintained within the (suspended) Linux environment.

Once the application has completed its high performance signal processing calculations within the DSP/BIOS environment, the virtualizer forces a transition back to the Linux environment, which provides access to the higher-level features available there.

The virtualizer-mediated sub-10 microsecond switch time between operating systems allows programmers to meet real-time performance requirements with little penalty compared to a native DSP/BIOS-only system. This solution incurs a penalty of only about 1.5 percent processing overhead for a typical multimedia device.

Additional Advantages to the Dual-OS System
Perhaps the simplest advantage to extending a Linux-based product to include the DSP/BIOS operating system is the ability to use algorithms from the hundreds of associated third parties with no porting effort. Compliance to the xDAIS standard guarantees seamless integration of these third party algorithms into a DSP/BIOS environment.

Another advantage of extending a Linux-based system to include DSP/BIOS is that applications executing in the DSP/BIOS environment are free from the constraints of the GNU General Public License (GPL) of the Linux kernel.

When implementing a Linux-based solution, it is not always obvious exactly what the licensing requirements of unique, developer produced software intellectual property are. By executing IP within the context of the DSP/BIOS OS instead of the Linux OS, it is possible to avoid this legal concern.

Conclusion
Using the technique described in this article, Linux and DSP BIOS may be run concurrently on a single DSP core. This provides all the functionality of a Linux solution while providing the precision and hardware control available under DSP/BIOS.

Programmers may take advantage of application code written for Linux and signal processing code written for DSP/BIOS without the effort of having to port one into the other environment.

For a designer who requires the features of Linux in a real-time, embedded application, upgrading to include the DSP BIOS toolset through the use of a virtualizer adds significantly improved signal-processing performance at a small cost in terms of system resources.

Dave Beal is director of product management for VirtualLogix, Inc., Steve Preissig is an instructor in Texas Instruments' Technical Training Organization, and Aurelien Jacquiot is Project Manager at VirtualLogix, France.