Using Direct Memory Access effectively in media-based embedded applications: Part 4In Part 3 of this series, we finished with a discussion on how best to make decisions on choosing between cache and DMA mechanisms for both instruction and data memory.
In this installment, we'll pick up where we left off, by discussing how advanced DMA features help move data effectively. We'll focus on how to optimize data transfers to and from memory, regardless of how you're handling instruction or data memory.
We'll then proceed to give an overview of some practical uses for DMA in multimedia applications.
Advanced DMA Controller
To effectively use DMA in a multimedia system, there must be enough DMA channels to support the processor's peripheral set fully, with more than one pair of Memory DMA (MemDMA) streams. This is an important point, because there are bound to be raw media streams incoming to external memory (via high-speed peripherals), while at the same time data blocks will be moving back and forth between external memory and L1 memory for core processing.
What's more, DMA engines that allow direct data transfer between peripherals and external memory, rather than requiring a "stopover" in L1 memory, can save extra data passes in numerically intensive algorithms.
A common mistake programmers make complicates debug issues during development. Peripherals and their corresponding DMA channels usually provide an optional error interrupt. This interrupt should always be enabled during development. Doing this can save hours of debug time. The error interrupts are typically used to indicate that something has been programmed incorrectly (an easy fix), or that a peripheral has underflowed/overflowed (a more complicated situation).
Many times when programmers set up a framework with the data flow at the beginning of the project, these types of issues only show up later, when the processing component of the application is added.
Other important DMA features include the ability to prioritize DMA channels to meet current peripheral task requirements, as well as the capacity to configure the corresponding DMA interrupts to match these priority levels. These functions help insure that data buffers do not overflow due to DMA activity on other peripherals, and they provide the programmer with extra degrees of freedom in optimizing the entire system based on the data traffic on each DMA channel.
System Performance Tuning
As data rates and performance demands increase, it becomes critical for designers to have "system performance tuning" controls at their disposal. For example, the DMA controller might be optimized to transfer a data word on every clock cycle. When there are multiple transfers ongoing in the same direction (e.g., all from internal memory to external memory), this is usually the most efficient way to operate the controller because it prevents idle time on the DMA bus.
But in cases involving multiple bidirectional video and audio streams, "direction control" becomes obligatory in order to prevent one stream from usurping the bus entirely. For instance, if the DMA controller always granted the DMA bus to any peripheral that was ready to transfer a data word, overall throughput would degrade when using external DRAM. In situations where data transfers switch direction on nearly every cycle, the latency associated with turn-around time on the external memory bus will lower throughput significantly.
As a result, DMA controllers that have a channel-programmable burst size will result in higher performance over those with a fixed transfer size. Because each DMA channel can connect a peripheral to either internal or external memory, it is also important to be able to automatically service a peripheral that may issue an urgent request for the bus.
For multimedia applications, on-chip memory is almost always insufficient for storing entire video frames. Therefore, the system must usually rely on L3 DRAM to support relatively fast access to large buffers. The processor interface to off-chip memory constitutes a major factor in designing efficient media frameworks, because access patterns to external memory must be well thought out in order to guarantee optimal data throughput.
Using DMA and/or cache will always help memory performance, because they involve transfers of large data blocks in the same direction. A DMA transfer typically moves a large data buffer from one location to another, while a cache-line fill moves a set of consecutive memory locations into or out of the device, utilizing block transfers in the same direction.
Aside from using DMA or cache, there are several high level steps that can ensure that data flows smoothly through memory in any system. Two key steps are (1) grouping like transfers and (2) using priority and arbitration schemes.
like transfers to minimize memory bus turnarounds.
Accesses to external memory are most efficient when they are made in the same direction (e.g. consecutive reads or consecutive writes). For example, when accessing off-chip synchronous memory, 16 reads followed by 16 writes are always completed sooner than 16 individual read/write sequences. This is because a write followed by a read incurs latency.
Random accesses to external memory generate a high probability of bus turnarounds. This added latency can easily halve available bandwidth. Therefore, it is important to take advantage of the ability to control the number of transfers in a given direction. This can be done either automatically or by manually scheduling your data movements.
A DMA channel garners access according to its priority, signified on Blackfin processors by its channel number. Higher priority channels are granted access to the DMA bus(es) first. Because of this, you should always assign higher priority DMA channels to peripherals with the highest data rates or with requirements for lowest latency.
To this end, Memory DMA streams are always lower in priority than peripheral DMA activity. This is due to the fact that with Memory DMA, no external devices will be held off or starved of data. Since a memory DMA channel requests access to the DMA bus as long as the channel is active, efficient use of any time slots unused by a peripheral DMA are applied to MemDMA transfers. By default, when more than one MemDMA stream is enabled and ready, only the highest priority MemDMA stream is granted.
When it is desirable for the MemDMA streams to share the available DMA bus bandwidth, however, the DMA controller can be programmed to select each stream in turn for a fixed number of transfers.
This direction control feature is an important consideration in optimizing use of resources on each DMA bus. By grouping same-direction transfers together, it provides a way to manage how frequently the transfer direction on the DMA buses changes. This is a handy way to perform a first level of optimization without real-time processor intervention. More importantly, there's no need to manually schedule bursts into the DMA streams.
When direction control features are used, the DMA controller preferentially grants data transfers on the DMA or memory buses that are going in the same read/write direction as in the previous transfer, until either the direction control counter times out, or until traffic stops or changes direction on its own. When the direction counter reaches zero, the DMA controller changes its preference to the opposite flow direction.
In this case, reversing direction wastes no bus cycles other than any physical bus turnaround delay time. This type of traffic control represents a trade-off of increased latency for improved utilization (efficiency). Higher traffic timeout values might increase the length of time each request waits for its grant, but they can dramatically improve the maximum attainable bandwidth in congested systems, often to above 90%.
Here's an example that puts these concepts into some perspective:
As a rule of thumb, it is best to maximize same-direction contiguous transfers during moderate system activity. For the most taxing system flows, however, it is best to select a value in the middle of the range to ensure no one peripheral gets locked out of accesses to external memory. This is especially crucial when at least 2 high-bandwidth peripherals (like PPIs) are used in the system.
In addition to using direction control, transfers among MemDMA streams can be alternated in a "round-robin" fashion on the bus as the application requires. With this type of arbitration, the first DMA process is granted access to the DMA bus for some number of cycles, followed by the second DMA process, and then back to the first. The channels alternate in this pattern until all of the data is transferred. This capability is most useful on dual-core processors (for example, when both core processors have tasks that are awaiting a data stream transfer).
Without this round-robin feature, the first set of DMA transfers will occur, and the second DMA process will be held off until the first one completes. Round-robin prioritization can help insure that both transfer streams will complete back-to-back.
Of course, this type of scheduling can be performed manually by buffering data bound for L3 memory in on-chip memory. The processor core can access on-chip buffers for pre-processing functions with much lower latency than it can by going off-chip for the same accesses. This leads to a direct increase in system performance. Moreover, buffering this data in on-chip memory allows more efficient peripheral DMA access to this data.
For instance, transferring a video frame on-the-fly through a video port and into L3 memory creates a situation where other peripherals might be locked out from accessing the data they need, because the video transfer is a high-priority process. However, by transferring lines incrementally from the video port into L1 or L2 memory, a MemDMA stream can be initiated that will quietly transfer this data into L3 as a low-priority process, allowing system peripherals access to the needed data.
and arbitration schemes between system resources
Another important consideration is the priority and arbitration schemes that regulate how processor subsystems behave with respect to one another. For instance, on Blackfin processors, the core has priority over DMA accesses, by default, for transactions involving L3 memory that arrive at the same time. This means that if a core read from L3 occurs at the same time a DMA controller requests a read from L3, the core will win, and its read will be completed first.
Let's look at a scenario that can cause trouble in a real-time system. When the processor has priority over the DMA controller on accesses to a shared resource like L3 memory, it can lock out a DMA channel that also may be trying to access the memory. Consider the case where the processor executes a tight loop that involves fetching data from external memory.
DMA activity will be held off until the processor loop has completed. It's not only a loop with a read embedded inside that can cause trouble. Activities like cache line fills or nonlinear code execution from L3 memory can also cause problems because they can result in a series of uninterruptible accesses.
Another issue meriting consideration is the priority scheme between the processor core(s) and the DMA controller(s). Highly useful, when available, is a setting that can make DMA activity always appear "urgent" " allowing DMA to win whenever it makes requests concurrent with the processor core(s).
There is always a temptation to rely on core accesses (instead of DMA) at early stages in a project, for a number of reasons. The first is that this mimics the way data is accessed on a typical prototype system. The second is that you don't always want to dig into the internal workings of the DMA functionality and performance. However, with core and DMA arbitration flexibility, using the memory DMA controller to bring data in and out of internal memory gives you more control of your destiny early on in the project.Six practical uses for DMA in multimedia systems
# 1. Using the DMA Controller to Eliminate Data. The DMA controller can be used to "filter" the amount of data that flows into a system from a camera. Let's consider the case where an active video stream is being brought into memory for some type of processing. When the data does not need to be sent back out for display purposes, it isn't necessary to transfer in the blanking data into the buffer in memory.
A processor's video port is often connected directly to a video decoder or a CMOS sensor and receives samples continuously. That is, the external device continues to clock in data and blanking information. The DMA controller can be set to transfer only the active video to memory. Using this type of functionality saves both memory space and bus bandwidth.
In the case of an NTSC video stream, this blanking represents over 20% of the total input video bandwidth. Saving memory is a minor benefit because extra memory is usually available externally in the form of SDRAM at a small system cost delta. More important is the bandwidth that is saved in the overall processing period; the time ordinarily used to bring in the blanking data can be re-allocated to some other task in your system, For example, it can be used to send out the compressed data or to bring in reference data from past frames.
#2. Double Buffering. We have previously discussed the need for double-buffering as a means of ensuring that current data is not overwritten by new data until you're ready for this to happen. Managing a video display buffer serves as a perfect example of this scheme. Normally, in systems involving different rates between source video and the final displayed content, it's necessary to have a smooth switchover between the old content and the new video frame.
This is accomplished using a double-buffer arrangement. One buffer points to the present video frame, which is sent to the display at a certain refresh rate. The second buffer fills with the newest output frame. When this latter buffer is full, a DMA interrupt signals that it's time to output the new frame to the display. At this point, the first buffer starts filling with processed video for display, while the second buffer outputs the current display frame. The two buffers keep switching back and forth in a "ping-pong" arrangement.
It should be noted that multiple buffers can be used, instead of just two, in order to provide more margin for synchronization, and to reduce the frequency of interrupts and their associated latencies.
#3. Two-dimensional DMA Considerations. A feature we discussed in a previous installment, two-dimensional DMA (2D DMA) capability, offers several system-level benefits. Let's briefly revisit this feature to discuss how it plays a role in both audio and video applications.
When data is transferred across a digital link like I2S, it may contain several channels. These may all be multiplexed on one data line going into the same serial port, for instance. In such a case, 2D DMA can be used to de-interleave the data so that each channel is linearly arranged in memory. Take a look at Figure 1 below for a graphical depiction of this arrangement, where samples from the left and right channels are de-multiplexed into two separate blocks. This automatic data arrangement is extremely valuable for those systems that employ block processing.
|Figure 1: A 2D DMA engine used to de-interleave (a) I2S stereo data into (b) separate left and right buffers|
For video transfers, 2D DMA offers several system-level benefits. For starters, two-dimensional DMA can facilitate transfers of macroblocks to and from external memory, allowing data manipulation as part of the actual transfer. This eliminates the overhead typically associated with transferring non-contiguous data. It can also allow the system to minimize data bandwidth by selectively transferring, say, only the desired region of an input image, instead of the entire image.
As another example, 2D DMA allows data to be placed into memory in a sequence more natural to processing. For example, as shown in Figure 2 below, RGB data may enter a processor's L2 memory from a CCD sensor in interleaved RGB888 format, but using 2D DMA, it can be transferred to L3 memory in separate R, G and B planes. Interleaving/deinterleaving color space components for video and image data saves additional data movement prior to processing.
|Figure 2: Deinterleaving data with 2D DMA|
#4. Synchronizing audio and video streams. In a multimedia system, the streaming content usually consists of both an audio and a video component. Because of the data rates at which these streams run, DMA channels must be used to interface with the audio codec and the video encoder. It is important from the viewer's standpoint to ensure that the streams are synchronized, because this coordination is a major contribution to perceived quality.
There are multiple ways to maintain synchronization. The technique most often used involves setting up a set of DMA descriptor lists for each of the audio and video buffers and having the processor manage those lists.
Because a glitch in an audio stream is more noticeable than a drop in a frame of video, the audio buffer is usually set to be the "master" stream. That is, it is more important to keep the audio buffer rotating in a continuous pattern. By keeping the audio stream continuous, the processor can make any necessary adjustments on the video frame display.
From a DMA buffer standpoint, a set of descriptors is created for each of the audio and video buffers. Each descriptor list is set up to be controlled with a pair of pointers, one for filling the buffers and one for emptying the buffers.
Each of the descriptor lists need to be maintained to ensure that the read and write pointers do not get "crossed". That is, the processor shouldn't be updating a buffer that is in the process of being sent out. Likewise, the DMA controller should not fill a buffer that the processor is filling.
For audio and video sync, an overall system time base is maintained. Each of the decoded buffers can be built in memory with a corresponding time tag. If the audio stream is the master stream, the write buffer is completely circular. If video frames have to be dropped, the DMA pointer that empties the buffer is re-programmed to match the time closest to the time stamp of the current audio buffer. Figure 3 below shows a general overview of this method for maintaining audio/video synchronization.
|Figure 3: Conceptual diagram of audio-video synchronization|
#5. Achieving Power Savings Using the DMA Controller. On a processor that has been designed with power management capabilities, the DMA controller can provide a valuable tool to reduce the overall power consumption in the system. Let's take a look at how this can be accomplished.
While a processor isn't actively working on a buffer, it can be put into an idle state. While in this inactive state, clocks can be shut off, and sometimes voltage can be reduced -- both of which will reduce the processor's power consumption.
Consider an audio decoding system. The processor performs the decoding function on encoded content. The decoded buffer accumulates in memory, and as soon as the buffer hits a "high-water" mark, the processor can be put into a sleep mode. While in this mode, the DMA controller (which is decoupled from the processor core) sources data from the buffer to feed an audio codec. It has to run continuously to ensure good audio quality.
A "low-water" mark can be implemented by programming the DMA controller to generate an interrupt after some number of data samples has been transferred. The interrupt can be programmed to serve as a wake-up event, which in turn will bring the processor out of its sleep mode. The processor can refill the buffer so that when the DMA controller wraps back around to the beginning of the buffer, new data is available.
If this routine runs continuously, the net effect is that processor duty cycle (i.e., the time the processor is active) is greatly reduced.
Implementing a DMA Queue Manager. A complicated system design
involve a surprisingly large number of DMA channels running in
parallel. When the DMA channels are programmed using descriptors, the
numbers of descriptors can explode. In these situations, it is best to
implement some form of DMA Queue Manager. An example of this type of
manager is provided in a PDF document
at Analog Devices.
In sum, a DMA controller is an integral part of any multimedia system, and it's crucial to appreciate its complexities in order to fully optimize an application. However, it's by no means the only player in the system; other system resources like memory and the processor core must arbitrate with the DMA controller, and achieving the perfect balance between them all involves gaining a fundamental understanding of how data moves around a system.To read Part 1 in this four part series, go to "The basics of direct memory access.
To read Part 2 in this four part series, go to "Classifying DMA transactions and the constructs associated with setting them up."
To read Part 3 in this four part series, go to "Important system designs regarding data movment."
This series of four articles is based on material from "Embedded Media Processing," by David Katz and Rick Gentile, published by Newnes/Elsevier.Rick Gentile and David Katz are senior DSP applications engineers in the Blackfin Applications Group at Analog Devices, Inc