Today's embedded applications are rapidly evolving. Traditional DSP applications are adding networking and other control functionality. At the same time, the typical MCU control application will often include streaming media processing and other DSP functions.
An emerging solution for this new class of “hybrid” application is the convergent processor. This design approach combines both DSP and RISC/microcontroller capabilities into a single, unified architecture. The convergent processor can operate solely a DSP engine, or be totally dedicated to a control application, or any point in between. This allows designers of everything from portable devices to industrial control to automotive infotainment to take advantage of lower costs and a smaller footprint, since they can often replace two devices ” RISC and DSP – with a single converged dual mode processor, such as the Blackfin (Figure 1, below).
However, the worlds of DSP and microcontroller (MCU) once operated independently, and the respective development engineers approached their craft very differently. Early DSP engineers did all of their work ” large amounts of data processed with an algorithm ” in assembly language. MCU engineers, on the other hand, focused their attention on complex interactions between external events and the application code. Now these two worlds are coming together, we need to understand the hardware and software issues that must be addressed as these disparate worlds move to a convergent processor.
The DSP engineer looks at the convergent processor and sees a DSP because the underlying processing engine includes one or more multiply accumulate (MAC) computational units and loop control registers. In the past, these capabilities were put to full use by hand programming in assembly. Today, the convergent processor also provides the DSP programmer a model that more closely matches what is done in a prototyping system in C or some simulation language. The DSP engineer, even when C is used, is focused on understanding where cycles are spent, because if it makes sense, key algorithms can be implemented in assembly.
Additionally, the DSP engineer seeks out ways to integrate a direct memory access (DMA) controller to move data around independent of the core. There are choices to be made in terms of which DMA channels should have priority based on the degree of how “real-time” they are and the amount of data being moved.
Similarly, the priority of interrupt levels are also programmable and need to be tuned for the application. Interrupt service routines are looked at as the most critical code and are of course treated as such, garnering a place in the fastest memory. Finally, the DSP programmer will look to use specialized instructions that can be used to accelerate specific processing such as Viterbi or video pixel processing, ideally through the use of intrinsics that come with the compiler. In short, the common theme on the DSP side is performance.
The microcontroller engineer, on the other hand, is generally developing code at the application level, well abstracted away from the complexities of the underlying MCU and its multitude of I/O capabilities. However, the engineering team will still be concerned to balance the needs of a wide variety of competing application needs for available processor resources, whether that is CPU time or I/O bandwidth.
They will want to verify that there is sufficient 'headroom' within the resources provided by the MCU even under the worst case situation when all the most critical interrupts in the I/O stream occur during the processing time of the highest priority application tasks. This is not usually an issue of execution within a hard deadline (except the for ISRs themselves of course), but whether the system as a whole can cope with the worst case loading and provide an overall response times within the constraints of the most demanding load generated by the target application design.
Such fine grain management of system resources as DMA, memory and cache is bread and butter to a DSP-centric engineer seeking to ensure his code block executes as efficiently as possible. Whereas an engineer working on an MCU-based application focuses more on the wider context of the system as a whole, balancing the real-time demands of multiple I/O events and application processing. If such detailed resource issues have to be addressed by the microcontroller engineer, it may be seen as a warning that overall system headroom is reaching its critical limit.
Having looked at some of the hardware issues of convergent processing, we now turn to the software side of the equation. When they decide to employ a convergent processor, the DSP and microcontroller engineers next have to agree on a single operating system for it. And it is not an obvious choice. If they select a traditional multi-tasking RTOS, they potentially add an enormous overhead burden to the DSP portion of the application. If they use a simple scheduler, as might be used to support a pure DSP application, it is doubtful that it will be a good fit for the control application. Let's briefly examine these choices to see where the problems lie.
The Traditional RTOS
The traditional multitasking RTOS model requires that each task have its own stack, allowing it to block while waiting for some event. Each task has a priority that allows it to pre-empt or to be preempted by a ready task of higher priority. Tasks also have the ability, due to the fact that they have an associated stack, to block (i.e., wait) for purposes of synchronization with system events.
In the traditional control system RTOS, the change of context form one task to another can involve a lot of operations. The current process' context has to be saved on its stack and the new current task's context has to be moved into the physical registers of the CPU. Sounds simple enough, but the fact is that with today's complex processors, a large context, often consisting of several tens of bytes, is the rule rather than the exception. The larger the context, the longer it takes to switch contexts.
This is tried and true operating system knowledge, having been around since the mid-60s. It's a great model for control processing, where tasks need to wait for a synchronizing event, or react to asynchronous events in a timely manner.
The DSP RTOS
DSP applications are typically real-time in nature and follow a block mode or data flow design. In them, a block of data is collected and then passed against an algorithm that loops until it consumes the data, often producing yet another block of data that gets sent to yet another algorithm for still more processing. Due to the real-time nature of the data, the DSP algorithm generally must start within a very tight time window once the input data has been collected. That timeliness means that there needs to be a predictable response time from the point at which the data is ready until the algorithm begins to consume it, and a predictable amount of time to consume the data and to output any data block for downstream algorithms.
To give structure to the processing of DSP algorithms, developers often design a custom executive that is thin on services but very fast and with a small footprint. The minimalist “DSP RTOS” is little more than an ad hoc loop that calls functions that perform the application's algorithms. When there is some attempt to organize the application around a rule-based architecture, it is usually a cooperative scheduler or a cyclic executive model in which all processes, including the operating system elements, share a single stack. Code processes in either of these two architectures generally have three common characteristics” (1) the process cannot block, i.e., wait; (2) once started, the process must run to completion before another process can be scheduled; and (3) the process has little or no context to save and restore.
The Problem: how to manage threads and tasks
As long as the application designer stays wholly in the DSP/data flow domain or wholly in the control domain, the two RTOS models just described work quite nicely. The problem arises when the developer wants to take advantage of the potential that convergent processor hardware offers, where the application is split between DSP/data flow and control code. Certainly, the traditional RTOS model can and has been used for a long time on some DSP processors. In these applications the developer must alter his DSP application to fit the policies, requirements and performance of the control RTOS.
In order to meet the real-time requirements of the DSP application, the user of a task-based, traditional RTOS typically runs the DSP tasks at very high priorities, generally higher than that for a control task. Because DSP operations often involve processing data as the result of interrupts, there can be a good deal of time spent switching from control tasks to DSP tasks. That increased time is one form of loading on the system. Another form of loading is the amount of time the task spends making calls to the services of the operating system.
Similarly, the use of an RTOS designed for DSP applications is often problematic when applied to a control application. Because control tasks are usually event driven, they lie dormant until some event causes them to wake up, at which time they do their job and then go back to sleep until the next event wakes them up. In simple terms, they need to block until the event occurs and then resume processing at that point. Since typical DSP RTOS processing operations are not designed to block or wait, the developer must alter his application and restructure his control operations.
Quadros Systems has taken the approach that the operating system needs to be able to adapt to the needs of the application, not vice versa. This is a more natural approach to a workable system. The RTXC Quadros RTOS is a family of four real-time operating systems with a common code base that is designed to provide an optimized environment for any application, whether pure DSP or pure control, or anywhere in between. RTXC provides a traditional multi-stack RTOS for control applications; a single stack RTOS for DSP/Data flow applications; a version for multiprocessing systems; and finally a dual mode version addressing the specific requirements of the convergent processor.
The dual-mode RTOS (RTXC/dm) combines a traditional task-based kernel architecture for real-time control processing with a specialized executive for DSP and dataflow operations. The architecture accommodates the different needs of the individual domains, DSP and control, by separating them. Yet even though they are separated, they are united by a common Application Programming Interface (API). This unified RTOS solution enables both types of application code to run fully optimized on a single processor. Such a dual mode approach deals directly with two important issues often associated with the selection of a convergent processor.
First, the two domains are separated much as they have been with domain-specific processors of the past. DSP engineers can still do DSP programming and control engineers can still do control coding. But with RTXC/dm, now they can also use the same development tools and can communicate easily between the domains, using the object classes and related services of the operating system instead of hand crafted processor-to-processor links.
Second, the DSP processes, lightweight code entities, called Threads in RTXC Quadros (Figure 2, below). Not to be confused with threads in Unix, Linux or Windows, RTXC Threads run at a priority that is higher than all control tasks, ensuring they get access to the CPU in order to meet their real-time requirements. Their lightweight nature derives from the fact that they have no context, making the switch from thread to thread very fast. Furthermore, threads run at a priority just below that of interrupt services, a position that tends to reduce Thread startup latency and minimize jitter.
|Figure 2: A schematic layout of RTXC/dm.|
Comparing the RTOS Models
One very obvious question is: How much difference does all this actually make in a practical application?
In an effort to demonstrate the inherent capabilities of a convergent processor when coupled with a matching RTOS, two test applications were created. These tests perform a side-by side comparison of a traditional event-driven, multi-tasking RTOS (RTXC/ms) against the RTXC dual mode RTOS (RTXC/dm), when running a dataflow, computationally intensive application ” audio decoding and playback.
In these tests we placed an increasingly heavier load on each of the two systems and compare the relative processor utilizations. We selected the Blackfin processor from Analog Devices' BF533 EZ-KIT Lite evaluation package.
For the purpose of this test we generated processor load by manipulating the number of interrupts, kernel services and context switches that the system must manage. Because of the way the DMA channel works on the Blackfin processor, it is possible to increase system loading by increasing the number of DMA interrupts the system must process. More DMA interrupts occur by decreasing the size of the blocks managed by the DMA as it feeds the audio codec, which runs at a fixed frequency of 48 KHz. The tests use a fixed size block of memory from which the DMA buffers are assigned such that the number of blocks in that fixed size space (16 Kbyes) determines the size of the DMA blocks.
The audio codec on the BF533 Lite board supports two independent channels, A & B, and each of those channels has a Left and Right pair. Each audio sample requires four bytes for one side of a channel. Thus, it requires 16 bytes to feed the audio codec for one sample time. As shown in Figures 2 and 3, there is a process called Reformat Filter. It is the job of that process to take a single decoded audio sample and to format it into a sample consistent with the needs of the audio codec.
The difference in the two tests is based on the configuration of the RTXC RTOS used by the application and the organization of the application processes. This first test uses a traditional RTOS model in which the application processes are tasks, except for interrupt service routines. This test uses the RTXC MultiStack RTOS (RTXC/ms). The second test uses the convergent processing approach with the RTXC Dual Mode RTOS (RTXC/dm).
|Figure 3: Basic flow diagram of the test application when organized under a traditional control RTOS model using RTXC/ms.|
From the bottom, the user selects a song to play. The file name is passed to the music file decoder. The decoder fetches blocks of encoded music (Ogg Vorbis format) from the storage media and decodes them. Decoded blocks of sound samples are passed to a re-format filter task whose responsibility it is to re-form the decoded music data into blocks of data consistent with the needs of the DMA-driven audio codec. The reformat task makes use of RTXC Pipes to move data along these paths in an efficient manner. An RTXC semaphore ensures that the Decoder and the Reformat Filter tasks maintain the proper synchronization.
The test progressively increases system loading by selecting the number of DMA buffers available to the application. Larger buffer sizes reduce the number of interrupts and context switches, allowing more continuous compute time. Larger numbers of buffers mean smaller sized blocks that result in more DMA interrupt processing and more task context switches, both sources of system loading.
|Figure 4: Test application using a dual mode RTOS (RTXC/dm), for a test of the Convergent Processing approach.|
In a test of the convergent processing approach using a dual mode configuration (Figure 4, above, the same computational elements exist, but this application design shifts the data flow processing of the Decoder and the Reformat Filter to two Threads on the same priority level. Computationally, the Decode functions in a similar manner but activates the Reformat Filter by scheduling it and then releasing CPU control, allowing the Reformat Filter to run.
That is a context switch but as Threads have no context, the amount of registers saved and restored is nil, thereby reducing the cycles necessary to process a buffer. Loading is further reduced by eliminating the ISR-to-task context switch in favor of switching from the ISR to a Thread. Because there is no context in and out of a Thread, it is possible to reduce the total number of cycles even more.
It takes the same number of cycles to decode and reformat a block of data whether it is done by a Task or by a Thread. Assuming the same size buffers in both models, the number of context switches and interrupts is the same. In fact, the improved efficiency results from reducing the overhead of the interrupt service processing, context switches and synchronization. It should also be noted that kernel services invoked at the Thread level operate about 3-5 times faster than the same services called from a Task.
In separate loads, we ran the two test applications, using the same number of DMA buffers (4, 8, 16, 32, 64, 96, 128, 192 and 256). Once the playback began, we monitored CPU utilization during the playing of the sound clips. For test purposes, we limited the duration of the sound clips to about 30 seconds.
The results of those test runs are summarized in the graph in Figure 5 below, where the Y-axis is the average CPU utilization value produced by the program. The X-axis shows the number of DMA buffers chosen for each test. Note that the X-axis uses a logarithmic scale.
The performance difference between the two tests begins to appear fairly early, as depicted in Figure 5. With the final choice of 256 DMA buffers, the system loading is heavily influenced by the number of context switches and the interrupt frequency that those 256 buffers put on the system. That selection results in a buffer size of 64 bytes (16,384/256), which is further divided by the requirement of 16 bytes per audio sample. Thus, the choice of 256 buffers represents four audio samples that go to the codec. With the codec running at 48 KHz, those four samples represent a time duration of 83.33 microseconds between DMA interrupts, or, an interrupt frequency of 12 KHz.
By contrast, the Thread-based test consumed fewer CPU cycles, approximately 40% fewer (25% vs 36%), as shown on Figure 5. Throughout the tests at all choices for the number of DMA buffers, there was no perceptible loss of sound quality at all, even at the last selection of 256 buffers. Clearly, the CPU utilization is noticeably better for the Thread-based approach in comparison to the Task-based approach as the number of DMA buffers selected increases system loading.
For a PDF version of this article, go to Embedded Systems Europe
For more resources on this topic, go to Multicores, multiprocessors and tools