Using Direct Memory Access effectively in media-based embedded applications: Part 3 - Embedded.com

Using Direct Memory Access effectively in media-based embedded applications: Part 3

In Part2 of this series, we discussed the different register-based anddescriptor-based DMAmodes. In this installment, we'll cover someimportant system decisions regarding data movement choices in anapplication. But first, let's revisit DMA modes to cover just a couplemore guidelines for choosing when to pick one over another.

For continuous, unidirectional data transfers of the same size, anautobuffer scheme makes the most sense. The DMA configuration registersare set up once and are automatically reloaded at the end of thetransfer. If multi-dimensional addressing is available, multiplebuffers can be set up, and individual interrupts can be set to triggerat the end of each buffer.

Transfers to an audio codec are perfect candidates for this type oftransaction. The number of sub-buffers that you select should beconsistent with the type of processing you need to perform. Forcontinuous transfers, just make sure to keep the maximum processinginterval for each buffer less than the time it takes to collect abuffer.

If the transfers on a given channel will change in direction or insize, the descriptor mode is best to use. Consider a series of smalltransfers between internal and external memory. If the block sizevaries, or if you want to navigate through a buffer in a non-continuousfashion, descriptors can be set up for this purpose.

Soon, we'll look at some system data movement scenarios involvingchoices between caching and DMA, but in order to do so, we first needto take a look at the types of data movement that exist within anapplication.

Because it is the most straightforward to illustrate, let's startwith data that is transferred into or out of a system via an on-chipperipheral. Many peripherals offer a choice between using core accessesand using a DMA channel to move data.

In general, given the option, you should use the DMA channel. TheDMA controller is best because data usually either comes in too slowlyor too quickly for the processor to efficiently handle it in real-time.Let's consider a few examples:

When we use a slow serial device such as an SPIport or UART, datais transferred at a rate much lower than the rate at which theprocessor core runs. Core accesses to these types of peripheralstypically involve polling some bit in a memory-mapped register.

Even if the peripheral operating speed is low compared to theprocessor clock (which means accesses will occur less frequently),polling is wasteful. In some cases, the peripheral has the ability tosignal an interrupt to indicate a core transfer has occurred. However,here the cost to service an interrupt, including time for contextswitches, is incurred after every incremental data transfer.

On the other hand, using the DMA controller to perform the transferallows fine control over the number of transfers that occur before aninterrupt is raised. Moreover, this interrupt can occur at the end ofeach “block” of data, not just after each byte or word.

On the other end of the throughput spectrum, a higher-speed parallelperipheral (running at, say, 10-100 MHz) may not have the option ofcore transfers — for two reasons. First, set up this way, theprocessor would constantly be accessing the peripheral. Second, theprocessing associated with high speed peripherals is almost always doneon data blocks.

Whether working through an FFT in a signal processing application or a two-dimensional convolution inan image processing system, the processor can begin its work once thelast data sample for that buffer has arrived. Here, the interrupt thatsignals the end of a block transfer can be spread over hundreds orthousands of transfers.

Regardless of the peripheral transfer type, the DMA channel shouldbe set up with multiple buffers so that the processor can access thecurrent buffer while the next is being filled. If the system is morecomplicated, it may involve multiple simultaneous block transfers. Forexample, in addition to accessing the current block and collecting thenext block, it may be necessary to send out the last processed blockfor future use. Similarly, blocks of reference data may be required toprocess the current frame. This is true for a variety of applications,including most types of video compression.

Example: Double- buffered audio
There are a number of ways to get audio data into the processor's core.For example, a foreground program can poll a serial port for new data,but this type of transfer is uncommon in embedded media processors,because it makes inefficient use of the core.

Instead, a processor connected to an audio codec usually uses a DMAengine to transfer the data from the codec link (like a serial port) tosome memory space available to the processor. This transfer of dataoccurs in the background without the core's intervention. The onlyoverhead is in setting up the DMA sequence and handling the interruptsonce the data buffer has been received or transmitted.

In a block-based processing system that uses DMA to transfer data toand from the processor core, a “double buffer” must exist to arbitratebetween the DMA transfers and the core. This is done so that theprocessor core and the core-independent DMA engine do not access thesame data at the same time, causing a data coherency problem.

To facilitate the processing of a buffer of length N, simply createa buffer of length 2×N. For a bi-directional system, two buffersof length 2×N must be created.

As shown in Figure 1a, below ,the core processes the in1 buffer and stores the result in the out1buffer, while the DMA engine is filling in0 and transmitting the datafrom out0 .

Figure 1b , below, depicts that once the DMAengine is done with the left half of the double buffers, it startstransferring data into in1 and out of out1, while the core processesdata from in0 and into out0 . This configuration issometimes called”ping-pong buffering,” because the core alternates between processingthe left and right halves of the double buffers.

Note that, in real-time systems, the serial port DMA (or anotherperipheral's DMA tied to theaudio sampling rate) dictates the timing budget. For thisreason, the block processing algorithm must be optimized in such a waythat its execution time is less than or equal to the time it takes theDMA to transfer data to/from one half of a double buffer.

Figure1: Double-buffering scheme for stream processing

System Guidelines for Choosingbetween DMA and Cache
The more complicated the data flow in an application, the more time youshould spend at the beginning of the project laying out a framework. Afundamental challenge in architecting a system relates to decidingwhich data buffers should be moved using a DMA channel, which onesshould be accessed via cache, and which ones are accessed by usingprocessor core reads and writes. Let's consider three widely usedsystem configurations to shed some light on which approach works bestfor different system classifications: (1) instruction cache, data DMA; (2) instruction cache, data DMA/cache; and  (3) instruction DMA, data DMA

InstructionCache, Data DMA. This is perhaps the most popular system model,because media processors are often architected with this usage profilein mind. Caching the code alleviates complex instruction flowmanagement, assuming the application can afford this luxury. This workswell when the system has no hard real-time constraints, so that a cachemiss would not wreak havoc on the timing of tightly coupled events (forexample, video refresh or audio/video synchronization).

Also, in cases where processor performance far outstrips processingdemand, caching instructions is often a safe path to follow, sincecache misses are then less likely to cause bottlenecks. Although itmight seem unusual to consider that an “oversized” processor would everbe used in practice, consider the case of a portable media player thatcan decode and play both compressed video and audio. In its audio-onlymode, its performance requirements will be only a fraction of its needsduring video playback. Therefore, the instruction/data managementmechanism could be different in each mode.

Managing data through DMA is the natural choice for most multimediaapplications, because these usually involve manipulating large buffersof compressed and uncompressed video, graphics and audio. Except incases where the data is quasi-static (for instance, a graphics iconconstantly displayed on a screen), caching these buffers makes littlesense, since the data changes rapidly and constantly.

Furthermore, as discussed above, there are usually multiple databuffers moving around the chip at one time ” unprocessed blocks headedfor conditioning, partly conditioned sections headed for temporarystorage, and completely processed segments destined for externaldisplay or storage. DMA is the logical management tool for thesebuffers, since it allows the core to operate on them without having toworry about how to move them around.

InstructionCache, Data DMA/Cache . This approach is similar to the one wejust described, except in this case, part of L1 Data Memory ispartitioned as cache, and the rest is left as SRAM for DMA access. Thisstructure is very useful for handling algorithms that involve a lot ofstatic coefficients or lookup tables. For example, storing asine/cosine table in data cache facilitates quick computation of FFTs.Or, quantization tables could be cached to expedite JPEG encoding ordecoding.

Keep in mind that this approach involves an inherent tradeoff. Whilethe application gains single-cycle access to commonly used constantsand tables, it relinquishes the equivalent amount of L1 Data SRAM, thuslimiting the buffer size available for single-cycle access to data. Auseful way to evaluate this tradeoff is to try alternate scenarios(DataDMA/Cache vs. only DMA) in aStatistical Profiler (offered in manydevelopment tools suites) to determine the percentage of timespent in code blocks under each circumstance.

InstructionDMA, Data DMA. In this scenario, data and code dependencies areso tightly intertwined that the developer must manually schedule wheninstruction and data segments move through the chip. In such hardreal-time systems, determinism is mandatory, and thus cache isn'tideal.

Although this approach requires more planning, the reward is adeterministic system where code is always present before the dataneeded to execute it, and no data blocks are lost via buffer overruns.Because DMA processes can link together without core involvement, thestart of a new process guarantees that the last one has finished, sothat the data or code movement is verified to have happened. This isthe most efficient way to synchronize data and instruction blocks.

The Instruction/Data DMA combination is also noteworthy for anotherreason. It provides a convenient way to test code and data flows in asystem during emulation and debug, when direct access to cache is nottypically available. The programmer can then make adjustments orhighlight “trouble spots” in the system configuration.

Figure2: Checklist for choosing between instruction cache and DMA

An example of a system that might require DMA for both instructionsand data is a video encoder/decoder. Certainly, video and itsassociated audio need to be deterministic for a satisfactory userexperience. If the DMA signaled an interrupt to the core after eachcomplete buffer transfer, this could introduce significant latency intothe system, since the interrupt would need to compete in priority withother events.

What's more, the context switch at the beginning and end of aninterrupt service routine would consume several core processor cycles.All of these factors interfere with the primary objective of keepingthe system deterministic.

Figure 2 above and Figure 3 below provide guidance inchoosing between cache and DMA for instructions and data, as well asnavigating the trade-off between using cache and using SRAM, based onthe guidelines we discussed earlier.

Figure3: Checklist for choosing between data cache and DMA

In our fourth and final installment of this DMA series, we willdiscuss some advanced DMA topics, including priority and arbitrationschemes, maximizing efficiency of memory transfers, and other complexissues.

 To read Part 1 in this fourpart series,goto “ Thebasics of direct memory access.
To read Part 2 in this four partseries, goto “ ClassifyingDMA transactions and the constructs associated with setting them up.”

Thisseries of four articles is based on material from “ EmbeddedMedia Processing,” by David Katzand Rick Gentile, published by Newnes/Elsevier .

Rick Gentile andDavid Katz are senior DSP applications engineersin the Blackfin Applications Group at AnalogDevices, Inc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.