The best way to move multimedia data -

The best way to move multimedia data


With embedded media processors assuming the role of both microcontroller and signal processor, engineers need to understand how various memory management options work on these processors. While cache may be your first choice, the more active approach of DMA may be your best bet.

Many embedded multimedia applications involve interaction between system control (typically a microcontroller's role) and signal processing—normally the role of a digital signal processor (DSP). A single embedded media processor can handle both types of tasks. Though it's tempting for programmers with only microcontroller experience to adopt a “hands-off” approach and simply use caches to manage the flow of code and data, it's best to carefully consider using the high-performance direct memory access (DMA) capabilities of a media processor instead.

When choosing between cache and DMA in a representative multimedia application, some tradeoffs are inherent, as shown in Figure 1. With this article, we aim to raise your awareness of these compromises and provide you with a guide you can use to design more optimized systems.

Figure 1: The trade-offs—DMA vs. cache

Memory architecture
Most media processors have hierarchical memory architectures that strive to balance several levels of memory with differing sizes and performance levels. Typically, the memory closest to the core processor (known as Level 1 or L1 memory) operates at the full clock rate and usually supports instruction execution in a single cycle. L1 memory is often partitioned into instruction and data segments to efficiently use memory-bus bandwidth. This memory is generally configurable as either SRAM or cache. Applications that require the most determinism can access on-chip SRAM in a single core clock cycle. For systems that require larger code sizes, additional on-chip and off-chip memory is available—with increased latency. Figure 2 shows a view of the flow of data through an embedded media processor.

Figure 2: Sample data flows on an embedded media processor

By itself, this memory architecture is only moderately useful; in today's high-speed processors, this hierarchy stalls the processor to accommodate larger applications that fit only in slower external memory. To improve performance, programmers can manually move important code in and out of internal SRAM. Also, adding data and instruction caches into the architecture makes external memory much more manageable. The cache reduces the manual movement of instructions and data into the processor core, thereby greatly simplifying the programming model.

Let's examine the two segments of L1 memory—instruction and data—and what type of memory management may work best with them.

Instruction memory
A quick survey of the embedded media processor market reveals core processor speeds at 600MHz and beyond. While this performance can open the door to many new applications, the maximum speed is only realized when code runs from internal L1 memory. Of course, the ideal embedded processor would have an unlimited amount of L1 memory. This, however, is not practical. Therefore, programmers must consider several alternatives to take advantage of the L1 memory that exists in the processor, while optimizing memory and data flows for their particular system. Let's examine some of these scenarios.

The first and most straightforward situation is when the target-application code fits entirely into L1 instruction memory. For this case, no special actions are required, other than for the programmer to map the application code directly to this memory space. Obviously media processors must excel in code density at the architectural level. This is why some offer a mix of 16-, 32-, and 64-bit opcodes, where the most frequently used instructions are 16 bits in length.

In the second scenario, a caching mechanism is used to allow programmers access to larger, less expensive external memories. The cache serves as a way to automatically bring code into L1 instruction memory as needed. The key advantage of this process is that the programmer doesn't have to manage the movement of code into and out of the cache. This method is best when the code being executed is somewhat linear in nature. For nonlinear code, cache lines may be replaced too often to allow any real performance improvement.

The instruction cache really performs two roles. For one, it helps pre-fetch instructions from external memory more efficiently. That is, when a cache miss occurs, a cache-line fill will fetch the desired instruction, along with the other instructions contained within the cache line. This ensures that, by the time the first instruction in the line has been executed, the instructions that immediately follow have also been fetched. In addition, since caches usually operate with some type of “least recently used” algorithm, instructions that run most often tend to be retained in cache. This is a plus, since an instruction in L1 cache can execute in a single core cycle, just as if it were in L1 SRAM. That is, if the code has already been fetched once and hasn't yet been replaced, the code will be ready for execution the next time through the loop.

Most strict real-time programmers tend not to trust cache to obtain the best system performance. They argue that if a set of instructions is not in cache when needed for execution, performance will degrade. Taking advantage of cache-locking mechanisms, however, can offset this issue. Once the critical instructions are loaded into cache, the cache lines can be locked, and thus not replaced. This gives programmers the ability to keep what they need in cache and to let the caching mechanism manage less-critical instructions.

In a final scenario, code can be moved in and out of L1 memory using a DMA channel that's independent of the processor core. While the core is operating on one section of memory, the DMA is bringing in the section to be executed next. This scheme is commonly referred to as an overlay technique.

While overlaying code into L1 instruction memory via DMA provides more determinism than caching it, the tradeoff comes in the form of increased programmer involvement. In other words, the programmer needs to map out an overlay strategy and configure the DMA channels appropriately. Still, the performance payoff for a well-planned approach can be well worth the extra effort.

Data memory
The data memory architecture of an embedded media processor is just as important to the overall system performance as the instruction clock speed. Because multiple data transfers are often taking place at any one time in a multimedia application, the bus structure must support both core and DMA accesses to all areas of internal and external memory. It's critical that the arbitration of the DMA controller and the core is handled automatically, or performance will be greatly reduced. Core-to-DMA interaction should only be required to set up the DMA controller and to respond to interrupts when data is ready to be processed.

A processor usually performs data fetches as part of its basic functionality. While this is typically the least efficient mechanism for transferring data to or from off-chip memory, it provides the simplest programming model. A small, fast scratchpad memory is sometimes available as part of L1 data memory, but for larger, off-chip buffers, access time will suffer if the core must fetch everything from external memory. Not only will it take multiple cycles to fetch the data, but the core will also be busy doing the fetches.

It's important to consider how the core processor handles reads and writes. Efficient media processors possess a multi-slot write buffer that can allow the core to proceed with subsequent instructions before all posted writes have completed. For example, in the following code sample, if P0 points to an address in external memory and P1 points to an address in internal memory, Line 50 will be executed before R0 (from Line 46) is written to external memory:

45: R0 = R1+R2;
46: [P0] = R0;
47: R3 = 0x0;
48: R4 = 0x0;
49: R5 = 0x0;
50: [P1] = R0;

In multimedia applications and other data-intensive operations, where large data stores constantly move into and out of SDRAM, this can create a difficult situation. While core fetches are always needed at times, large transfers must be done using DMA or cache to preserve performance.

DMA for managing data
To effectively use DMA in a multimedia system, the system must have enough DMA channels to fully support the processor's peripheral set, with more than one pair of memory DMA streams. Multiple streams are important because raw media streams are bound to come into external memory (via high-speed peripherals) at the same time data blocks are moving back and forth between external memory and L1 memory for core processing. What's more, DMA engines that allow direct data transfer between peripherals and external memory, rather than requiring a “stopover” in L1 memory, can save extra data passes in numerically intensive algorithms.

As data rates and performance demands increase, it becomes critical for designers to have “system performance tuning” controls at their disposal. For example, the DMA controller might be optimized to transfer a data word on every clock cycle. When multiple transfers are ongoing in the same direction (for example, all from internal memory to external memory), this is usually the most efficient way to operate the controller because it prevents idle time on the DMA bus.

But in cases involving multiple bidirectional video and audio streams, “burst control” becomes obligatory in order to prevent one stream from usurping the bus entirely. For instance, if the DMA controller always granted the DMA bus to any peripheral that was ready to transfer a data word, overall throughput would degrade when connected to an SDRAM device. In situations where data transfers switch direction on nearly every cycle, the latency associated with turn-around time on the SDRAM bus will lower throughput significantly. As a result, DMA controllers that have a channel-programmable burst size hold a clear advantage over those with a fixed transfer size. Because each DMA channel can connect a peripheral to either internal or external memory, it is also important to be able to automatically service a peripheral that may issue an urgent request for the bus.

Another feature, two-dimensional DMA capability, offers several system-level benefits. For one, it allows data to be placed into memory in a more intuitive processing order. For example, as Figure 3 shows, luma/chroma or RGB data may come in sequentially from an image sensor, but it can be automatically stored in separate memory buffers. The interleaving/deinterleaving functionality of 2D DMA saves additional memory bus transactions prior to processing video and image data. Two-dimensional DMA can also allow the system to minimize data bandwidth by selectively transferring, say, only the desired region of an input image, instead of the entire image.

Figure 3: 2D DMA separates data into buffers on-the-fly

Other important DMA features include the ability to prioritize DMA channels to meet current peripheral task requirements, as well as the capacity to configure the corresponding DMA interrupts to match these priority levels. These functions help ensure that data buffers do not overflow due to DMA activity on other peripherals, and they provide the programmer with extra degrees of freedom in optimizing the entire system based on the data traffic on each DMA channel.

Because internal memory is typically constructed in sub-banks, simultaneous access by the DMA controller and the core can be accomplished in a single cycle by placing data in separate sub-banks. The core can operate on data in one sub-bank while the DMA fills a new buffer in a second sub-bank. Simultaneous access to the same sub-bank is also possible under some conditions.

Data cache
The flexibility of today's DMA controllers is a double-edged sword. When a large C/C++ application is ported between processors, the programmer is sometimes hesitant to integrate DMA functionality into already working code. This is where data cache can be very useful. Typically, the data cache brings data in to L1 memory for the fastest processing. The data cache is attractive because it acts like a mini-DMA, but with minimal interaction on the programmer's part.

Because of the nature of typical cache-line fills, data cache is most useful when the processor is operating on consecutive data locations in external memory. This is because the cache doesn't just store the immediate data currently being processed; instead, it prefetches data in a region contiguous to the current data. In other words, the cache mechanism assumes that there's a good chance that the current data word is part of a block of neighboring data about to be processed. For multimedia streams, this is a reasonable conjecture.

Since data buffers usually originate from external peripherals, operating with data cache is not always as easy as with instruction cache. This is because coherency must be managed manually in non-“snooping” caches. For these caches, the data buffer must be invalidated before making any attempt to access the new data. In the context of a C-based application, this type of data is “volatile.”

In the general case, when the value of a variable stored in cache is different from its value in the source memory, it can mean that the cache line is “dirty” and has to still be written back to memory. This concept doesn't hold for volatile data. Rather, in this case the cache line may be “clean,” but the source memory has changed without the knowledge of the core processor. In this scenario, before the core can safely access a volatile variable in data cache, it must invalidate (but not flush!) the affected cache line. This usually can be performed in one of two ways. The cache line can be directly written, or a “cache invalidate” instruction can be executed to invalidate the target memory address. The direct method is usually more cumbersome, but both techniques can be used interchangeably. However, the direct method is usually a better option when a large data buffer is present (for example, one greater in size than the data cache size). The invalidate instruction is always preferable when the buffer size is smaller than the size of the cache. This is true even when a loop is required, since the invalidate instruction usually increments by the size of each cache line instead of the more typical 1-, 2-, or 4-byte increment of normal addressing modes.

There is another important point to make about volatile variables, regardless of whether they're cached or not. If they're shared by both the core processor and the DMA controller, the programmer must implement some type of semaphore for safe operation. In sum, it's best to keep volatiles out of data cache altogether.

Choosing cache or DMA
Let's consider three widely used system configurations to determine which approach works best for certain system classifications.

Instruction cache, data DMA
This is perhaps the most popular system model, because media processors are often designed with this usage profile in mind. Caching the code alleviates complex instruction flow management, assuming the application can afford this luxury. This works well where the system has no hard real-time constraints, so that a cache miss would not wreak havoc upon the timing of tightly coupled events (for example, video refresh or audio/video synchronization).

Also, in cases where processor performance far outstrips processing demand, caching instructions is often a safe path to follow, since cache misses are then less likely to cause bottlenecks. Although it might seem unusual to consider that an “oversized” processor would ever be used in practice, consider the case of a portable media player that can decode and play both compressed video and audio. In its audio-only mode, its performance requirements will be only a fraction of its needs during video playback. Therefore, the instruction/data management mechanism could be different in each mode.

Managing data through DMA is the natural choice for most multimedia applications, because these usually involve manipulating large buffers of compressed and uncompressed video, graphics and audio. Except in cases where the data is quasi-static (for instance, a graphics icon constantly displayed on a screen), caching these buffers makes little sense, since the data changes rapidly and constantly. Furthermore, as discussed above, usually multiple data buffers are moving around the chip at one time—unprocessed blocks heading for conditioning, partly conditioned sections heading for temporary storage, and completely processed segments destined for external display or storage. DMA is the logical management tool for these buffers, since it enables the core to operate on them without having to worry about how to move them around.

Instruction cache, data DMA/cache
This approach is similar to the instruction cache/data DMA model, except in this case, part of L1 data memory is partitioned as cache and the rest is left as SRAM that can be used for DMA. This structure is useful for handling algorithms that involve a lot of static coefficients or lookup tables. For example, storing a sine/cosine table in data cache facilitates quick computation of FFTs. As another example, quantization tables could be cached to expedite JPEG encoding or decoding. Keep in mind that this approach involves an inherent tradeoff. While the application gains single-cycle access to commonly used constants and tables, it gives up the equivalent amount of data SRAM, thus limiting the buffer size available for single-cycle access to data. A useful way to evaluate this tradeoff is to try alternate scenarios (data DMA/cache versus only DMA) in a statistical profiler (offered in many development tools suites) to determine the percentage of time spent in code blocks under each circumstance.

Instruction DMA, data DMA
In this scenario, data and code dependencies are so tightly intertwined that the developer must manually schedule when instruction and data segments move through the chip. In such hard real-time systems, determinism is mandatory, and thus cache isn't ideal.

Although this approach requires more planning, the reward is a deterministic system where code is always present before the data needed to execute it, and no data blocks are lost via buffer overruns. Because DMA processes can link together without core involvement, the start of a new process guarantees that the last one has finished, so that the data or code movement is guaranteed to have happened. This is the most efficient way to synchronize data and instruction blocks.

The instruction/data DMA combination is also noteworthy for another reason. It provides a convenient way to test code and data flows in a system during emulation and debug, when direct access to cache is not typically available. The programmer can then make adjustments or highlight trouble spots in the system configuration.

An example of a system that might require DMA for both instructions and data is a video encoder/decoder. Certainly, video and its associated audio need to be deterministic for a satisfactory user experience. If the DMA signaled an interrupt to the core after each complete buffer transfer, this could introduce significant latency into the system, since the interrupt needs to compete in priority with other events. What's more, the context switch at the beginning and end of an interrupt service routine consumes several core processor cycles. All of these factors interfere with the primary objective of keeping the system deterministic.

Shades of gray
In short, there's no single answer as to whether cache or DMA should be the mechanism of choice for code and data movement in a given system. Use Figures 4 and 5 as guides in choosing cache or DMA for instructions and data.

Figure 4: Instruction cache vs. DMA decision flow

Figure 5: Data cache vs. DMA decision flow

David Katz is a senior applications engineer at Analog Devices working on Blackfin media processors and has served in Motorola's cable modem and automation groups as a senior design engineer. David holds a BS and MEng in electrical engineering from Cornell University. He can be reached at .

Rick Gentile leads the Blackfin applications group at Analog Devices. Previously, he was a member of the technical staff at MIT's Lincoln Laboratory designing DSP systems used in radar sensors. He has a BS from the University of Massachusetts and an MS from Northeastern University, both in electrical and computer engineering. He can be reached at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.