Managing intelligent I/O processing on DSP + GPP SoCs -

Managing intelligent I/O processing on DSP + GPP SoCs


This Product How-To article about TI’s OMAP-L138 C6-Integra DSP + ARM processor details the steps a developer needs to follow in building an application that must balance I/O processing tasks between a general purpose microcontroller and a digital signal processor.

DSP system developers face a difficult task of choosing options to meet increasing system performance requirements. Up the speed, optimize the code, add more processors, or even all of the above! It’s possible, but the developers must partition code intelligently between the general purpose processor and digital signal processors (DSPs) to make the best use of their architectures.

For instance, certain input/output (I/O) tasks can be offloaded to the general purpose processor (GPP) to implement smart I/O processing with features such as predictive caching, buffer parsing, sequencing and more.

Adding an obstacle, the developer may need to change the functionality of the system down the road and must have the flexibility to change the roles of each core to make sure hard and soft real-time can be met.

System developers must also decide if an operating system is needed, and if so, how to make sure the system-level I/O throughput comes closer to theoretical raw throughput maximums. I will examine the options a DSP system developer faces when partitioning I/O processing tasks and how to best implement the GPP in certain cases.

Why a DSP-centric model
Since their invention, programmable DSPs were aimed at a specific signal-processing task, most of the time for executing a body of inter-related DSP algorithms.

Historical examples include the V-series modem, the Global System for Mobile Communications (GSM) baseband vocoder used in cellular phones, the audio processor in stereo receivers, the video encoder in security cameras and the more recent vision processor for automotive or software defined radio (SDR) phones.

Take the V.32 modem for example. The main purpose of the product is to squeeze as much data as possible down below the 3 KHz bandwidth of the Public Switching Telephone Network (PSTN).

It offered 9,600 bps, especially impressive when 1,200 bps was the best data transmission rate available (for single-pair, non-leased lines) most consumers could get for years. The programmable DSP can execute the main algorithm, quadrature-amplitude modulation (QAM), and a host of others, including equalization, error correction, echo cancellation and more.

Most of these algorithms are chained together, representing phases of a signal processing pipeline so that the DSP can process a block of data at a time within required time slots.

We differentiate the signal processing tasks in a DSP-based system from the rest of the tasks, commonly referred to as “control” tasks. A PCI-based modem card, for example, may need to respond to driver commands from the host side to abort the transmission, go to lower speed or begin a “power down” sequence.

While the abort or power down control tasks can be simple requests since they involve the termination of the pipeline, the dynamic switching of the data rate may not be as simple since this request may involve real-time switching of algorithms or coefficients, i.e. modifying the signal processing pipeline.

As embedded system designers look for ways to accelerate signal processing performance by adding “DSP” to the system, the DSP system designer can do the reverse, offloading control plane tasks to a GPP. This move will make a lot of sense if the purposes are to:

1) Retain the investment in existing DSP code, including familiarity of tools and instruction set architecture (ISA);
2) Incrementally add “control plane” features to accommodate expanding market requirements; and
3) Lastly but most importantly, not disturb the signal processing pipeline that is tuned to the data rates and the tested use-case, e.g. field trials or certified code.

Since GPPs are mainly designed for control plane processing, it is logical to partition event-driven tasks such as I/O control to this side. Now that we have a processor dedicated for this purpose, there are now a lot of possibilities in terms of what we can do:

– Add more complicated I/O with high-level stacks to the system;
– Create custom I/O handling module(s) to pre-process the data; or
– Enhance the intelligence of the I/O control process with strategies such as data caching schemes, adaptive rate control and buffer management.

The next big question is, “What are we doing to do with the GPP side?” This will bring the discussion to the pros and cons of using OSes for I/O management.

Control plane processing
Let’s start out with a simple, modern SoC consisting of a DSP core, a GPP core, a system direct memory access (DMA), several I/O peripherals such as serial ports, general purpose input/output (GPIOs), USB and Ethernet.

The purpose of the system is to apply a series of signal processing functions to a stream of data from a serial port and output the result to another serial port. The operation and characteristic of the processing can be affected locally via a front panel connected to the GPIO pins and remotely via Ethernet. The hypothetical system is illustrated in Figure 1 below.

Figure 1: Hypothetical DSP+GPP-based system
To manage just a few I/O buffering schemes and simple I/O tasks such as scanning buttons, an application with a simple control loop running on the GPP can satisfy the requirements.

But when it comes time to add more sophisticated control threads such as a TCP/ IP networking stack to enable even a small HTTP server, our loop begins to look like a hard thing to maintain.

A small scheduler is needed. Working on a round-robin basis, the scheduler can help modularize the partitioning of “who does what” in the system with a standardized execution begin-end convention for a given task.

At this point, we probably have created what’s called a cooperative multitasking scheme where an arrangement can be reached between tasks. However, as the number of tasks and modules grow, the timing between who runs when becomes a challenge. Tasks no longer wish to cooperate, and there are many events to prioritize. The classic solution is to preempt them by giving each task a timeslice.

Based on a periodic timer tick, the scheduler can now give threads in the system a controllably even time of execution while maximizing central processing unit (CPU) utilization.

As with any growing pain, rules need to be created to determine who gets to use which resources when. Developing and maintaining this part of your application can become a full time job if the code base and use case grows.

Fortunately, there are a lot of choices when it comes to preemptive multitasking OSes suitable for embedded design.i Some are even certified for mission critical applications such as life support devices or for use in aircraft navigation equipment.

Perhaps the richest in terms of availability of application packages, high level stacks and a large pool of programmers is Linux.

Since its creation in 1991, Linux started on the desktop PC and quickly found its way to servers and embedded systems. Today, with the explosion of small form factor devices, a lot of attention is being given to embedded designs, increasing the availability of embedded Linux and application packages ported to various types of GPPs used in SoCs.

From an I/O control standpoint, which is our interest, Linux comes with a wide archive of I/O drivers, stacks and several I/O schedulers. Out of the almost 13.5 million lines of source code in Linux kernel 2.6.35, about half are driver-related code.

Device support is divided into several driver categories: character, block, network and bus. Complex I/Os such as USB, Wi-Fi and Ethernet have complete stacks to support the protocols. The 2.6 kernel so far contains four selectable I/O schedulers: no-op, anticipatory, deadline and complete-fair-queueing (CFQ).

They do “smart” things such as request merging, anticipatory reads, impose deadlines and more. Targeted mainly for hard disk access, the schedulers have been optimized for use in servers and can be a good area to explore for use in embedded applications.

Getting real (time) about Linux
One much discussed topic regarding Linux, especially for embedded systems, is its ability to respond in real time. We are not just talking about responsiveness, i.e. the ability to quickly give CPU attention to the right higher priority processes and threads during run time.

This has been improved tremendously starting with kernel version 2.6 through using pre-emptive multitasking in the kernel and a better process scheduler. Instead, our focus will be upon how applications can meet a deadline and the mechanism Linux has provided so far to support it.

Deadlines can be classified as soft or hard, or simply, non-fatal or fatal, respectively. The definition of what’s fatal however can be subjective. For example, when a deadline for delivering an audio packet to the transmitter is missed, the effect can result in an audible click or pop.

If this happens on an MP3 player, the user experience could diminish, but that can be considered non-fatal. However, if the system is a professional public address system in a football stadium, the result could be blown speakers, and that could be fatal.

Deadline in embedded system is usually defined by the ability of the system to respond to a certain event within a pre-determined window. If responding outside of that window still produces an acceptable result, we have a soft real-time system. Otherwise, it is a hard real-time system.

The key parameter in determining whether a system can meet the requirements of a hard real-time system is how predictable or deterministic it can meet a deadline generated from an event. This point is illustrated in Figure 2 below.

The event arrives via a generated hardware interrupt, and there’s a fixed deadline in time when a certain must complete. To service the event, an interrupt service routine (ISR) will be dispatched. But before that point, the processor needs to be interrupted.

Figure 2: Latency predictability has to include worst-case jitter.
In most systems, an interrupt dispatcher routine actually starts next to manage the nesting of interrupts. The total latency is determined by adding the hardware latency and the interrupt handler latency.

The hardware interrupt latency can vary depending on whether the global interrupt has been disabled temporarily by the processor due to a critical section in its pipeline like an atomic instruction. This variance produces the first jitter on the diagram.

Because of caches, internal bus arbitration and external memory contentions, the actual number of cycles through the handler routine is not at all times the same, but varies within a range, which is the second jitter. We need to really determine what these jitters are to know the worst-case latency.

To get a really accurate impression, it has to be measured on real hardware running real application scenarios over a period of time. For simple situations where the ISR is all that is needed to get the job done, if the total latency plus the ISR response time is less than the time to the deadline, the system can be defined as hard real time.

Sometimes the ISR is just part of the total latency equation. Its role may be just to initiate a bigger function embedded in a thread, task or process. Scheduling the start of this task is the job of the scheduler.

It looks at pending requests and their priorities and determines when the task can start. When that time comes, it performs a context switch and transfers the CPU execution to the thread or task. There are several places where latency can vary as well here.

If the OS is non-preemptive, the wait could take for some time because the current kernel thread has to finish before the next task can begin. The pre-emptive kernel can shorten the wait because each task has only a finite time-slice to execute.

Also, important data structures could be protected with semaphores or mutexes. The OS and drivers manipulate these structures using critical sections to prevent accidental access.

During these sections, interrupts are usually disabled at some intervals for atomic operations. So if the OS is in use, the delay when the ISR posts an event to when a context switch occurs needs to be measured and included in the total response time latency.

To optimize Linux for real-time processing, we need to understand the scheduling mechanisms provided in the kernel, which is too involved for the scope of this article.

To reduce the latency in scheduling, the Linux kernel has to be built with the preemptible feature selected (CONFIG_PREEMPT). To further reduce latencies, the Linux Real-Time (RT) patch can be used. It adds the CONFIG_PREEMPT_RT option to the kernel build.

The Linux RT code does this by reducing the number of critical sections in the kernel, or replacing spinlocks with rtmutexes so that the operation is preemptible.

I/O processing on real hardware
Both the general embedded system designer and DSP system designer should look for an SoC that allows flexible programming models — a GPP-centric setup where it is the master, or a DSP-centric one where the GPP is its coprocessor. Figure 3 below shows a block diagram of the OMAP-L138 C6-Integra DSP + ARM processor.

Figure 3: OMAP-L138 C6-Integra DSP+ARM processor (To view larger image , click here)

All peripherals previously accessible by the DSP can also be controlled by the ARM. Externally, the SoC employs a unified memory map that is shared between the two processors within the 4-GB address space. Because there are two cores on OMAP-L138 C6-Integra DSP + ARM processor, they need to communicate using an inter-processor communication (IPC) hardware mechanism.

An example IPC for the OMAP-L138 C6-Integra DSP + ARM processor is shown in Figure 4 . At the SoC level, five CHIPINT bits are allocated in the CHIPSIG memory-mapped register (MMR) located in the SYSCFG system configuration module for the signaling between the DSP and ARM.

Up to four signals can be mapped for the DSP to notify the ARM and two for the ARM to notify the DSP with an additional signal just for the DSP non-maskable interrupt (NMI) event.

Note that two bits are common to both the DSP and ARM so that they can be both interrupted at the same time. This is a useful feature for debugging purposes. Writing one to the bit will generate the signal or event.

These events are fed to the respective interrupt controller (INTC) to get mapped to the core interrupt inputs. To pass data between cores, any of the 128-KB internal or 512-MB external memory areas can be used as shared memory.

Mutual exclusivity can be controlled using the mutex or semaphore mechanisms provided with the OS. The SoC provides a system-level memory protection unit (MPU) that can protect a memory region from being overwritten by internal bus masters like the ARM or DSP cores or the DMAs. This feature can be useful during development to debug the IPC software mechanism or detect ill-behaved programs or memory leaks.

Figure 4: DSP+ARM inter-processor communication (IPC) mechanism

Implementation strategies
Because all on-chip I/O peripherals are accessible from the ARM or DSP on integrated DSP + ARM processors, it is possible to mix and match which peripherals are controlled by which processor to meet the real-time needs of the system. For example, the architecture will support the following scenarios:

– Signal processing on DSP and all I/O processing on ARM; and
– Signal processing and selected hard real-time I/O processing on DSP and other soft real-time I/O processing and background tasks on ARM.

On the DSP side, the DSP/BIOS software kernel foundation can be used as the multitasking preemptive OS, or no OS at all, to maximize the DSP million instructions per second? (MIPs) for signal processing pipelines.

On the ARM side, you can also decide whether you want to use a real-time OS such as embedded Linux, Wind River’s VxWorks , Green Hills’ Integrity Secure Virtualization (ISV) or no OS at all.

Perhaps the main advantage of the no-OS approach to handling I/O is that a developer can truly customize the architecture to the data flow to maximize throughput. This means that in some cases, it’s possible to get close to the theoretical raw throughput that the device and system can handle.

Many vendors offer Board Support Libraries (BSLs) — a great starting point to start experimenting with the bare silicon to achieve optimum I/O transfers. If a lightweight OS is needed, many are available for the ARM9 core.

To implement a bare-bones application on the ARM side for I/O processing, you will need the BSL from the board vendor. It comes in source form and provides functions to set up and control various on chip I/Os available on the device.

This either comes with a development board/evaluation module on the CD or can be downloaded from the manufacturer support web site. For example, the BSL for the OMAP-L138 C6-Integra DSP + ARM processor discussed above is armed with TI’s Code Composer Studio integrated development environment and handles the low-level code effort needed to talk to bare silicon.

The quick path to develop with embedded Linux on many DSP + ARM processors is with a Linux software development kit (SDK), which often contain Linux kernel and drivers, vendor-specific software components as well as additional open-source packages.

Silicon vendors commonly offer these libraries, including Streamer, a multimedia middleware, and Qt, a graphical user interface (GUI) framework, to enable sophisticated applications.

There are several options to balancing processing tasks between an ARM core and a DSP to achieve optimal SoC performance. In some cases, the ARM performs best as the master core, and in fatal systems or applications that require intensive real-time signal processing, the DSP often performs best as the master core.

Integrated SoCs offer expandability and flexibility to change the role of each processor as the application necessitates by enabling communication between the two processors and giving both processors access to the peripherals.

Developers can also achieve rapid prototyping and efficiently program each core with available development tools and software support that ranges from non-OS to embedded real-time OSes like Linux.

Loc Truong is a technologist and a senior member of Texas Instruments’ C6000 digital signal processor (DSP) technical staff. He is currently leading an effort to identify system solutions for TI’s portfolio of high-performance single-core, multicore and open source C6000 DSPs. Truong has also authored and presented many papers related to embedded systems design, signal processing as well as embedded Linux and multicore programming. He is the holder of several US patents.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.