Using Direct Memory Access effectively in media-based embedded applications: Part 4 - Embedded.com

Using Direct Memory Access effectively in media-based embedded applications: Part 4

In Part3 of this series, we finished with a discussion on how best tomake decisions on choosing between cache and DMA mechanisms for bothinstruction and data memory.

In this installment, we'll pick up where we left off, by discussinghow advanced DMA features help move data effectively. We'll focus onhow to optimize data transfers to and from memory, regardless of howyou're handling instruction or data memory.

We'll then proceed to give an overview of some practical uses forDMA in multimedia applications.

Advanced DMA ControllerFeatures
To effectively use DMA in a multimedia system, there must be enough DMAchannels to support the processor's peripheral set fully, with morethan one pair of Memory DMA (MemDMA) streams. This is an importantpoint, because there are bound to be raw media streams incoming toexternal memory (via high-speed peripherals), while at the same timedata blocks will be moving back and forth between external memory andL1 memory for core processing.

What's more, DMA engines that allow direct data transfer betweenperipherals and external memory, rather than requiring a “stopover” inL1 memory, can save extra data passes in numerically intensivealgorithms.

A common mistake programmers make complicates debug issues duringdevelopment. Peripherals and their corresponding DMA channels usuallyprovide an optional error interrupt. This interrupt should always beenabled during development. Doing this can save hours of debug time.The error interrupts are typically used to indicate that something hasbeen programmed incorrectly (an easy fix), or that a peripheral hasunderflowed/overflowed (a more complicated situation).

Many times when programmers set up a framework with the data flow atthe beginning of the project, these types of issues only show up later,when the processing component of the application is added.

Other important DMA features include the ability to prioritize DMAchannels to meet current peripheral task requirements, as well as thecapacity to configure the corresponding DMA interrupts to match thesepriority levels. These functions help insure that data buffers do notoverflow due to DMA activity on other peripherals, and they provide theprogrammer with extra degrees of freedom in optimizing the entiresystem based on the data traffic on each DMA channel.

System Performance Tuning
As data rates and performance demands increase, it becomes critical fordesigners to have “system performance tuning” controls at theirdisposal. For example, the DMA controller might be optimized totransfer a data word on every clock cycle. When there are multipletransfers ongoing in the same direction (e.g., all from internal memoryto external memory), this is usually the most efficient way to operatethe controller because it prevents idle time on the DMA bus.

But in cases involving multiple bidirectional video and audiostreams, “direction control” becomes obligatory in order to prevent onestream from usurping the bus entirely. For instance, if the DMAcontroller always granted the DMA bus to any peripheral that was readyto transfer a data word, overall throughput would degrade when usingexternal DRAM. In situations where data transfers switch direction onnearly every cycle, the latency associated with turn-around time on theexternal memory bus will lower throughput significantly.

As a result, DMA controllers that have a channel-programmable burstsize will result in higher performance over those with a fixed transfersize. Because each DMA channel can connect a peripheral to eitherinternal or external memory, it is also important to be able toautomatically service a peripheral that may issue an urgent request forthe bus.

For multimedia applications, on-chip memory is almost alwaysinsufficient for storing entire video frames. Therefore, the systemmust usually rely on L3 DRAM to support relatively fast access to largebuffers. The processor interface to off-chip memory constitutes a majorfactor in designing efficient media frameworks, because access patternsto external memory must be well thought out in order to guaranteeoptimal data throughput.

Using DMA and/or cache will always help memory performance, becausethey involve transfers of large data blocks in the same direction. ADMA transfer typically moves a large data buffer from one location toanother, while a cache-line fill moves a set of consecutive memorylocations into or out of the device, utilizing block transfers in thesame direction.

Aside from using DMA or cache, there are several high level stepsthat can ensure that data flows smoothly through memory in any system.Two key steps are (1) grouping like transfers and (2) using priorityand arbitration schemes.

(1) Groupinglike transfers to minimize memory bus turnarounds.
Accesses to external memory are most efficient when they are made inthe same direction (e.g. consecutive reads or consecutive writes). Forexample, when accessing off-chip synchronous memory, 16 reads followedby 16 writes are always completed sooner than 16 individual read/writesequences. This is because a write followed by a read incurs latency.

Random accesses to external memory generate a high probability ofbus turnarounds. This added latency can easily halve availablebandwidth. Therefore, it is important to take advantage of the abilityto control the number of transfers in a given direction. This can bedone either automatically or by manually scheduling your datamovements.

A DMA channel garners access according to its priority, signified onBlackfin processors by its channel number. Higher priority channels aregranted access to the DMA bus(es) first. Because of this, you shouldalways assign higher priority DMA channels to peripherals with thehighest data rates or with requirements for lowest latency.

To this end, Memory DMA streams are always lower in priority thanperipheral DMA activity. This is due to the fact that with Memory DMA,no external devices will be held off or starved of data. Since a memoryDMA channel requests access to the DMA bus as long as the channel isactive, efficient use of any time slots unused by a peripheral DMA areapplied to MemDMA transfers. By default, when more than one MemDMAstream is enabled and ready, only the highest priority MemDMA stream isgranted.

When it is desirable for the MemDMA streams to share the availableDMAbus bandwidth, however, the DMA controller can be programmed to selecteach stream in turn for a fixed number of transfers.

This direction control feature is an important consideration inoptimizing use of resources on each DMA bus. By grouping same-directiontransfers together, it provides a way to manage how frequently thetransfer direction on the DMA buses changes. This is a handy way toperform a first level of optimization without real-time processorintervention. More importantly, there's no need to manually schedulebursts into the DMA streams.

When direction control features are used, the DMA controllerpreferentially grants data transfers on the DMA or memory buses thatare going in the same read/write direction as in the previous transfer,until either the direction control counter times out, or until trafficstops or changes direction on its own. When the direction counterreaches zero, the DMA controller changes its preference to the oppositeflow direction.

In this case, reversing direction wastes no bus cycles other thanany physical bus turnaround delay time. This type of traffic controlrepresents a trade-off of increased latency for improved utilization(efficiency). Higher traffic timeout values might increase the lengthof time each request waits for its grant, but they can dramaticallyimprove the maximum attainable bandwidth in congested systems, often toabove 90%.

Here's an example that puts these concepts into some perspective:

As a rule of thumb, it is best to maximize same-direction contiguoustransfers during moderate system activity. For the most taxing systemflows, however, it is best to select a value in the middle of the rangeto ensure no one peripheral gets locked out of accesses to externalmemory. This is especially crucial when at least 2 high-bandwidthperipherals (like PPIs) are used in the system.

In addition to using direction control, transfers among MemDMAstreamscan be alternated in a “round-robin” fashion on the bus as theapplication requires. With this type of arbitration, the first DMAprocess is granted access to the DMA bus for some number of cycles,followed by the second DMA process, and then back to the first. Thechannels alternate in this pattern until all of the data istransferred. This capability is most useful on dual-core processors(for example, when both core processors have tasks that are awaiting adata stream transfer).

Without this round-robin feature, the first set of DMA transferswill occur, and the second DMA process will be held off until the firstone completes. Round-robin prioritization can help insure that bothtransfer streams will complete back-to-back.

Of course, this type of scheduling can be performed manually bybuffering data bound for L3 memory in on-chip memory. The processorcore can access on-chip buffers for pre-processing functions with muchlower latency than it can by going off-chip for the same accesses. Thisleads to a direct increase in system performance. Moreover, bufferingthis data in on-chip memory allows more efficient peripheral DMA accessto this data.

For instance, transferring a video frame on-the-fly through a videoport and into L3 memory creates a situation where other peripheralsmight be locked out from accessing the data they need, because thevideo transfer is a high-priority process. However, by transferringlines incrementally from the video port into L1 or L2 memory, a MemDMAstream can be initiated that will quietly transfer this data into L3 asa low-priority process, allowing system peripherals access to theneeded data.

(2) Priorityand arbitration schemes between system resources
Another important consideration is the priority and arbitration schemesthat regulate how processor subsystems behave with respect to oneanother. For instance, on Blackfin processors, the core has priorityover DMA accesses, by default, for transactions involving L3 memorythat arrive at the same time. This means that if a core read from L3occurs at the same time a DMA controller requests a read from L3, thecore will win, and its read will be completed first.

Let's look at a scenario that can cause trouble in a real-timesystem. When the processor has priority over the DMA controller onaccesses to a shared resource like L3 memory, it can lock out a DMAchannel that also may be trying to access the memory. Consider the casewhere the processor executes a tight loop that involves fetching datafrom external memory.

DMA activity will be held off until the processor loop hascompleted. It's not only a loop with a read embedded inside that cancause trouble. Activities like cache line fills or nonlinear codeexecution from L3 memory can also cause problems because they canresult in a series of uninterruptible accesses.

Another issue meriting consideration is the priority scheme betweenthe processor core(s) and the DMA controller(s). Highly useful, whenavailable, is a setting that can make DMA activity always appear”urgent” ” allowing DMA to win whenever it makes requests concurrentwith the processor core(s).

There is always a temptation to rely on core accesses (instead ofDMA) at early stages in a project, for a number of reasons. The firstis that this mimics the way data is accessed on a typical prototypesystem. The second is that you don't always want to dig into theinternal workings of the DMA functionality and performance. However,with core and DMA arbitration flexibility, using the memory DMAcontroller to bring data in and out of internal memory gives you morecontrol of your destiny early on in the project.

Six practical uses for DMA inmultimedia systems

# 1. UsingtheDMA Controller to Eliminate Data. The DMA controller can be usedto “filter” the amount of data that flows into a system from a camera.Let's consider the case where an active video stream is being broughtinto memory for some type of processing. When the data does not need tobe sent back out for display purposes, it isn't necessary to transferin the blanking data into the buffer in memory.

A processor's video port is often connected directly to a videodecoder or a CMOS sensor and receives samples continuously. That is,the external device continues to clock in data and blankinginformation. The DMA controller can be set to transfer only the activevideo to memory. Using this type of functionality saves both memoryspace and bus bandwidth.

In the case of an NTSC video stream, this blanking represents over20% of the total input video bandwidth. Saving memory is a minorbenefit because extra memory is usually available externally in theform of SDRAM at a small system cost delta. More important is thebandwidth that is saved in the overall processing period; the timeordinarily used to bring in the blanking data can be re-allocated tosome other task in your system, For example, it can be used to send outthe compressed data or to bring in reference data from past frames.

#2. DoubleBuffering. We have previously discussed the need fordouble-buffering as a means of ensuring that current data is notoverwritten by new data until you're ready for this to happen. Managinga video display buffer serves as a perfect example of this scheme.Normally, in systems involving different rates between source video andthe final displayed content, it's necessary to have a smooth switchoverbetween the old content and the new video frame.

This is accomplished using a double-buffer arrangement. One bufferpoints to the present video frame, which is sent to the display at acertain refresh rate. The second buffer fills with the newest outputframe. When this latter buffer is full, a DMA interrupt signals thatit's time to output the new frame to the display. At this point, thefirst buffer starts filling with processed video for display, while thesecond buffer outputs the current display frame. The two buffers keepswitching back and forth in a “ping-pong” arrangement.

It should be noted that multiple buffers can be used, instead ofjust two, in order to provide more margin for synchronization, and toreduce the frequency of interrupts and their associated latencies.

#3.Two-dimensional DMA Considerations . A feature we discussedin a previous installment, two-dimensional DMA (2D DMA) capability,offers several system-level benefits. Let's briefly revisit thisfeature to discuss how it plays a role in both audio and videoapplications.

When data is transferred across a digital link like I2S, it maycontain several channels. These may all be multiplexed on one data linegoing into the same serial port, for instance. In such a case, 2D DMAcan be used to de-interleave the data so that each channel is linearlyarranged in memory. Take a look at Figure 1 below for a graphicaldepiction of this arrangement, where samples from the left and rightchannels are de-multiplexed into two separate blocks. This automaticdata arrangement is extremely valuable for those systems that employblock processing.

Figure1: A 2D DMA engine used to de-interleave (a) I2S stereo data into (b)separate left and right buffers

For video transfers, 2D DMA offers several system-level benefits.For starters, two-dimensional DMA can facilitate transfers ofmacroblocks to and from external memory, allowing data manipulation aspart of the actual transfer. This eliminates the overhead typicallyassociated with transferring non-contiguous data. It can also allow thesystem to minimize data bandwidth by selectively transferring, say,only the desired region of an input image, instead of the entire image.

As another example, 2D DMA allows data to be placed into memory in asequence more natural to processing. For example, as shown in Figure 2below, RGB data may enter a processor's L2 memory from a CCD sensor ininterleaved RGB888 format, but using 2D DMA, it can be transferred toL3 memory in separate R, G and B planes. Interleaving/deinterleavingcolor space components for video and image data saves additional datamovement prior to processing.

Figure2: Deinterleaving data with 2D DMA

#4.Synchronizing audio and video streams. In a multimedia system,the streaming content usually consists of both an audio and a videocomponent. Because of the data rates at which these streams run, DMAchannels must be used to interface with the audio codec and the videoencoder. It is important from the viewer's standpoint to ensure thatthe streams are synchronized, because this coordination is a majorcontribution to perceived quality.

There are multiple ways to maintain synchronization. The techniquemost often used involves setting up a set of DMA descriptor lists foreach of the audio and video buffers and having the processor managethose lists.

Because a glitch in an audio stream is more noticeable than a dropin a frame of video, the audio buffer is usually set to be the “master”stream. That is, it is more important to keep the audio buffer rotatingin a continuous pattern. By keeping the audio stream continuous, theprocessor can make any necessary adjustments on the video framedisplay.

From a DMA buffer standpoint, a set of descriptors is created foreach of the audio and video buffers. Each descriptor list is set up tobe controlled with a pair of pointers, one for filling the buffers andone for emptying the buffers.

Each of the descriptor lists need to be maintained to ensure thatthe read and write pointers do not get “crossed”. That is, theprocessor shouldn't be updating a buffer that is in the process ofbeing sent out. Likewise, the DMA controller should not fill a bufferthat the processor is filling.

For audio and video sync, an overall system time base is maintained.Each of the decoded buffers can be built in memory with a correspondingtime tag. If the audio stream is the master stream, the write buffer iscompletely circular. If video frames have to be dropped, the DMApointer that empties the buffer is re-programmed to match the timeclosest to the time stamp of the current audio buffer. Figure 3 belowshows a general overview of this method for maintaining audio/videosynchronization.

Figure3: Conceptual diagram of audio-video synchronization

#5. AchievingPower Savings Using the DMA Controller. On a processor that hasbeen designed with power management capabilities, the DMA controllercan provide a valuable tool to reduce the overall power consumption inthe system. Let's take a look at how this can be accomplished.

While a processor isn't actively working on a buffer, it can be putinto an idle state. While in this inactive state, clocks can be shutoff, and sometimes voltage can be reduced — both of which will reducethe processor's power consumption.

Consider an audio decoding system. The processor performs thedecoding function on encoded content. The decoded buffer accumulates inmemory, and as soon as the buffer hits a “high-water” mark, theprocessor can be put into a sleep mode. While in this mode, the DMAcontroller (which is decoupled from the processor core) sources datafrom the buffer to feed an audio codec. It has to run continuously toensure good audio quality.

A “low-water” mark can be implemented by programming the DMAcontroller to generate an interrupt after some number of data sampleshas been transferred. The interrupt can be programmed to serve as awake-up event, which in turn will bring the processor out of its sleepmode. The processor can refill the buffer so that when the DMAcontroller wraps back around to the beginning of the buffer, new datais available.

If this routine runs continuously, the net effect is that processorduty cycle (i.e., the time the processor is active) is greatly reduced.

#6.Implementing a DMA Queue Manager. A complicated system designcaninvolve a surprisingly large number of DMA channels running inparallel. When the DMA channels are programmed using descriptors, thenumbers of descriptors can explode. In these situations, it is best toimplement some form of DMA Queue Manager. An example of this type ofmanager is provided in a PDF documentavailableonlineat Analog Devices.

The DMA Manager is best used to manage the scheduling of concurrenttransfers. The programming interface is configured to take in newdescriptors, which represent new work units.

In sum, a DMA controller is an integral part of any multimediasystem, and it's crucial to appreciate its complexities in order tofully optimize an application. However, it's by no means the onlyplayer in the system; other system resources like memory and theprocessor core must arbitrate with the DMA controller, and achievingthe perfect balance between them all involves gaining a fundamentalunderstanding of how data moves around a system.

To read Part 1 in this fourpart series, go to “ Thebasics of direct memory access.
To read Part 2 in this four partseries, go to “ ClassifyingDMA transactions and the constructs associated with setting them up.”
To read Part 3 in this four partseries, go to “ Importantsystem designs regarding data movment.”

This series of four articles is based on material from “EmbeddedMedia Processing,” by David Katzand Rick Gentile, published by Newnes/Elsevier.

Rick Gentile and David Katz are senior DSP applications engineers inthe BlackfinApplications Group at Analog Devices,Inc

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.