Using Direct Memory Access effectively in media-based embedded applications: Part 3In Part 2 of this series, we discussed the different register-based and descriptor-based DMA modes. In this installment, we'll cover some important system decisions regarding data movement choices in an application. But first, let's revisit DMA modes to cover just a couple more guidelines for choosing when to pick one over another.
For continuous, unidirectional data transfers of the same size, an autobuffer scheme makes the most sense. The DMA configuration registers are set up once and are automatically reloaded at the end of the transfer. If multi-dimensional addressing is available, multiple buffers can be set up, and individual interrupts can be set to trigger at the end of each buffer.
Transfers to an audio codec are perfect candidates for this type of transaction. The number of sub-buffers that you select should be consistent with the type of processing you need to perform. For continuous transfers, just make sure to keep the maximum processing interval for each buffer less than the time it takes to collect a buffer.
If the transfers on a given channel will change in direction or in size, the descriptor mode is best to use. Consider a series of small transfers between internal and external memory. If the block size varies, or if you want to navigate through a buffer in a non-continuous fashion, descriptors can be set up for this purpose.
Soon, we'll look at some system data movement scenarios involving choices between caching and DMA, but in order to do so, we first need to take a look at the types of data movement that exist within an application.
Because it is the most straightforward to illustrate, let's start
with data that is transferred into or out of a system via an on-chip
peripheral. Many peripherals offer a choice between using core accesses
and using a DMA channel to move data.
In general, given the option, you should use the DMA channel. The DMA controller is best because data usually either comes in too slowly or too quickly for the processor to efficiently handle it in real-time. Let's consider a few examples:
When we use a slow serial device such as an SPI
port or UART, data
is transferred at a rate much lower than the rate at which the
processor core runs. Core accesses to these types of peripherals
typically involve polling some bit in a memory-mapped register.
Even if the peripheral operating speed is low compared to the processor clock (which means accesses will occur less frequently), polling is wasteful. In some cases, the peripheral has the ability to signal an interrupt to indicate a core transfer has occurred. However, here the cost to service an interrupt, including time for context switches, is incurred after every incremental data transfer.
On the other hand, using the DMA controller to perform the transfer allows fine control over the number of transfers that occur before an interrupt is raised. Moreover, this interrupt can occur at the end of each "block" of data, not just after each byte or word.
On the other end of the throughput spectrum, a higher-speed parallel peripheral (running at, say, 10-100 MHz) may not have the option of core transfers -- for two reasons. First, set up this way, the processor would constantly be accessing the peripheral. Second, the processing associated with high speed peripherals is almost always done on data blocks.
Whether working through an FFT in a signal processing application or a two-dimensional convolution in an image processing system, the processor can begin its work once the last data sample for that buffer has arrived. Here, the interrupt that signals the end of a block transfer can be spread over hundreds or thousands of transfers.
Regardless of the peripheral transfer type, the DMA channel should be set up with multiple buffers so that the processor can access the current buffer while the next is being filled. If the system is more complicated, it may involve multiple simultaneous block transfers. For example, in addition to accessing the current block and collecting the next block, it may be necessary to send out the last processed block for future use. Similarly, blocks of reference data may be required to process the current frame. This is true for a variety of applications, including most types of video compression.
Example: Double- buffered audio
There are a number of ways to get audio data into the processor's core. For example, a foreground program can poll a serial port for new data, but this type of transfer is uncommon in embedded media processors, because it makes inefficient use of the core.
Instead, a processor connected to an audio codec usually uses a DMA engine to transfer the data from the codec link (like a serial port) to some memory space available to the processor. This transfer of data occurs in the background without the core's intervention. The only overhead is in setting up the DMA sequence and handling the interrupts once the data buffer has been received or transmitted.
In a block-based processing system that uses DMA to transfer data to
and from the processor core, a "double buffer" must exist to arbitrate
between the DMA transfers and the core. This is done so that the
processor core and the core-independent DMA engine do not access the
same data at the same time, causing a data coherency problem.
To facilitate the processing of a buffer of length N, simply create
a buffer of length 2×N. For a bi-directional system, two buffers
of length 2×N must be created.
As shown in Figure 1a, below,
the core processes the in1 buffer and stores the result in the out1
buffer, while the DMA engine is filling in0 and transmitting the data
Figure 1b, below, depicts that once the DMA engine is done with the left half of the double buffers, it starts transferring data into in1 and out of out1, while the core processes data from in0 and into out0. This configuration is sometimes called "ping-pong buffering," because the core alternates between processing the left and right halves of the double buffers.
Note that, in real-time systems, the serial port DMA (or another peripheral's DMA tied to the audio sampling rate) dictates the timing budget. For this reason, the block processing algorithm must be optimized in such a way that its execution time is less than or equal to the time it takes the DMA to transfer data to/from one half of a double buffer.
|Figure 1: Double-buffering scheme for stream processing|
System Guidelines for Choosing
between DMA and Cache
The more complicated the data flow in an application, the more time you should spend at the beginning of the project laying out a framework. A fundamental challenge in architecting a system relates to deciding which data buffers should be moved using a DMA channel, which ones should be accessed via cache, and which ones are accessed by using processor core reads and writes. Let's consider three widely used system configurations to shed some light on which approach works best for different system classifications: (1) instruction cache, data DMA; (2) instruction cache, data DMA/cache; and (3) instruction DMA, data DMA
Instruction Cache, Data DMA. This is perhaps the most popular system model, because media processors are often architected with this usage profile in mind. Caching the code alleviates complex instruction flow management, assuming the application can afford this luxury. This works well when the system has no hard real-time constraints, so that a cache miss would not wreak havoc on the timing of tightly coupled events (for example, video refresh or audio/video synchronization).
Also, in cases where processor performance far outstrips processing demand, caching instructions is often a safe path to follow, since cache misses are then less likely to cause bottlenecks. Although it might seem unusual to consider that an "oversized" processor would ever be used in practice, consider the case of a portable media player that can decode and play both compressed video and audio. In its audio-only mode, its performance requirements will be only a fraction of its needs during video playback. Therefore, the instruction/data management mechanism could be different in each mode.
Managing data through DMA is the natural choice for most multimedia applications, because these usually involve manipulating large buffers of compressed and uncompressed video, graphics and audio. Except in cases where the data is quasi-static (for instance, a graphics icon constantly displayed on a screen), caching these buffers makes little sense, since the data changes rapidly and constantly.
Furthermore, as discussed above, there are usually multiple data buffers moving around the chip at one time " unprocessed blocks headed for conditioning, partly conditioned sections headed for temporary storage, and completely processed segments destined for external display or storage. DMA is the logical management tool for these buffers, since it allows the core to operate on them without having to worry about how to move them around.
Instruction Cache, Data DMA/Cache. This approach is similar to the one we just described, except in this case, part of L1 Data Memory is partitioned as cache, and the rest is left as SRAM for DMA access. This structure is very useful for handling algorithms that involve a lot of static coefficients or lookup tables. For example, storing a sine/cosine table in data cache facilitates quick computation of FFTs. Or, quantization tables could be cached to expedite JPEG encoding or decoding.
Keep in mind that this approach involves an inherent tradeoff. While the application gains single-cycle access to commonly used constants and tables, it relinquishes the equivalent amount of L1 Data SRAM, thus limiting the buffer size available for single-cycle access to data. A useful way to evaluate this tradeoff is to try alternate scenarios (Data DMA/Cache vs. only DMA) in a Statistical Profiler (offered in many development tools suites) to determine the percentage of time spent in code blocks under each circumstance.
Instruction DMA, Data DMA. In this scenario, data and code dependencies are so tightly intertwined that the developer must manually schedule when instruction and data segments move through the chip. In such hard real-time systems, determinism is mandatory, and thus cache isn't ideal.
Although this approach requires more planning, the reward is a deterministic system where code is always present before the data needed to execute it, and no data blocks are lost via buffer overruns. Because DMA processes can link together without core involvement, the start of a new process guarantees that the last one has finished, so that the data or code movement is verified to have happened. This is the most efficient way to synchronize data and instruction blocks.
The Instruction/Data DMA combination is also noteworthy for another
reason. It provides a convenient way to test code and data flows in a
system during emulation and debug, when direct access to cache is not
typically available. The programmer can then make adjustments or
highlight "trouble spots" in the system configuration.
|Figure 2: Checklist for choosing between instruction cache and DMA|
An example of a system that might require DMA for both instructions
and data is a video encoder/decoder. Certainly, video and its
associated audio need to be deterministic for a satisfactory user
experience. If the DMA signaled an interrupt to the core after each
complete buffer transfer, this could introduce significant latency into
the system, since the interrupt would need to compete in priority with
What's more, the context switch at the beginning and end of an interrupt service routine would consume several core processor cycles. All of these factors interfere with the primary objective of keeping the system deterministic.
Figure 2 above and Figure 3 below provide guidance in choosing between cache and DMA for instructions and data, as well as navigating the trade-off between using cache and using SRAM, based on the guidelines we discussed earlier.
|Figure 3: Checklist for choosing between data cache and DMA|
In our fourth and final installment of this DMA series, we will
discuss some advanced DMA topics, including priority and arbitration
schemes, maximizing efficiency of memory transfers, and other complex
To read Part 2 in this four part series, go to "Classifying DMA transactions and the constructs associated with setting them up."
This series of four articles is based on material from "Embedded Media Processing," by David Katz and Rick Gentile, published by Newnes/Elsevier .Rick Gentile and David Katz are senior DSP applications engineers in the Blackfin Applications Group at Analog Devices, Inc.
Currently no items