Using Direct Memory Access effectively in media-based embedded applications – Part 1

An embedded processor core is capable of doing multiple operations in asingle cycle, including calculations, data fetches, data stores andpointer increments/decrements. In addition, the core can orchestratedata transfer between internal and external memory spaces by movingdata into and out of the register file.

All this sounds great, but in reality, you can only achieve optimumperformance in your application if data can move around withoutconstantly bothering the core to perform the transfers.

This is where a direct memoryaccess (DMA) controller comes into play. Processors needDMAcapability to relieve the core from these transfers betweeninternal/external memory and peripherals, or between memory spaces(Memory DMA, or “MemDMA”).

There are two main types of DMA controllers. “Cycle-stealing” DMAuses spare (idle) core cycles to perform data transfers. This is not aworkable solution for systems with heavy processing loads likemultimedia flows. Instead, it is much more efficient to employ thesecond type: a DMAcontroller that operates independently from the core.

Why is this so important? Well, imagine if a processor's video porthas a FIFO that needs to be read every time a data sample is available.In this case, the core has to be interrupted tens of millions of timeseach second. As if that's not disruptive enough, the core has toperform an equal amount of writes to some destination in memory. Forevery core processing cycle spent on this task, a corresponding cyclewould be lost in the processing loop.

As good as DMA sounds in theory, PC-based software designerstransitioning to the embedded world are hesitant to rely on a DMAcontroller for moving data around in an application. This reluctanceusually stems from their impression that the complexity of theprogramming model increases exponentially when DMA is factored in.

Our goal, however, is to put your mind at ease, to show you how DMAis truly your friend. In this series of articles, we'll focus on theDMA controller itself, then show you how to optimize performance withDMA, and finally offer ideas on how best to manage the DMA controlleras part of an overall framework.

Let's take a quick aside to discuss memory space nomenclature.Embedded processors have hierarchical memory architecturesthat strive to balance several levels of memory with differing sizesand performance levels. The memory closest to the core processor (knownas “Level 1,” or “L1,” memory) operates at the full core clock rate.

The use of the term “closest” is literal, in that L1 memory isphysically close to the core processor on the silicon die, so as toachieve the highest access and operating speeds. L1 memory is mostoften partitioned into Instruction and Data segments for efficientutilization of memory bus bandwidth.

Of course, L1 memory is necessarily limited in size. For systemsthat require larger code sizes, additional on-chip and off-chip memoryis available—with increased latency. Larger on-chip memory is calledLevel 2 (“L2”) memory, and we refer to external memory as Level 3(“L3”) memory. While the L1 memory size usually comprises tens ofkBytes, the L2 memory on-chip is measured in hundreds of kBytes, and L3can easily be megabytes.

The basics of DMA control
OK ” back to the discussion at hand. A DMA controller is a uniqueperipheral devoted to moving data around a system. Think of it as acontroller that connects internal and external memories with eachDMA-capable peripheral via a set of dedicated buses. It is a peripheralin the sense that the processor programs it to perform transfers.

It is unique in that it interfaces to both memory and selectedperipherals. Notably, only peripherals where data flow is significant(kBytes per second or greater) need to be DMA-capable. Good examples ofthese are video, audio and network interfaces. Lower-bandwidthperipherals can also be equipped with DMA capability, but it's less ofan imposition on the core to step in and assist with data transfer onthese interfaces.

In general, DMA controllers will include an address bus, a data bus,and control registers. An efficient DMA controller will possess theability to request access to any resource it needs, without having theprocessor itself get involved. It must have the capability to generateinterrupts. Finally, it has to be able to calculate addresses withinthe controller.

A processor might contain multiple DMA controllers. Each controllerhas multiple DMA channels, as well as multiple buses that link directlyto the memory banks and peripherals, as shown in Figure 1 below . There are two typesof DMA controllers in many high-performance processors. The firstcategory, usually referred to as a System DMA Controller, allows accessto any resource (peripherals and memory).

Cycle counts for this type of controller are measured in SystemClocks (SCLKs) at frequencies up to 133MHz (using ADI's Blackfinprocessor as an example). The second type, an Internal Memory DMAcontroller (IMDMA), is dedicated to accesses between internal memorylocations. Because the accesses are internal (L1 to L1, L1 to L2, or L2to L2), cycle counts are measured in Core Clocks (CCLKs), which canexceed 600MHz rates.

Figure1: System and internal memory DMA architecture

Each DMA controller has a set of FIFOs that act as a buffer betweenthe DMA subsystem and peripherals or memory. For MemDMA, a FIFO existson both the source and destination sides of the transfer. The FIFOimproves performance by providing a place to hold data while busyresources are preventing a transfer from completing.

Configuring a DMA controller
Because you'll typically configure a DMA controller during codeinitialization, the core should only need to respond to interruptsafter data set transfers are complete. You can program the DMAcontroller to move data in parallel with the core, while the core isdoing its basic processing tasks ” the jobs on which it's supposed tobe focused!

In an optimized application, the core would never have to move anydata, but rather only access it in L1 memory. The core wouldn't need towait for data to arrive, because the DMA engine would have already madeit available by the time the core was ready to access it. Figure 2  below shows a typicalinteraction between the processor and the DMA controller. The stepsallocated to the processor involve setting up the transfer, enablinginterrupts, and running code when an interrupt is generated. Theinterrupt input back to the processor can be used to signal that datais ready for processing.

Figure2: DMA Controller

In addition to moving to and from peripherals, data also needs tomove from one memory space to another. For example, source video mightflow from a video port straight to L3 memory, because the workingbuffer size is too large to fit into internal memory. We don't want tomake the processor fetch pixels from external memory every time we needto perform a calculation, so a memory-to-memory DMA (“MemDMA”) canbring pixels into L1 or L2 memory for more efficient access times. Figure 3 below shows some typicalDMA data flows.

Figure3: Typical DMA flows

So far we've focused on data movement, but a DMA transfer doesn'talways have to involve data. We can use code overlays to improveperformance, configuring the DMA controller to move code into L1Instruction memory before execution. The code is usually staged inlarger external memory and selectively brought into L1 as needed.

Programming the DMA controller
Let's take a look at what options we have in specifying DMA activity.We will start with the simplest model and build up to more flexiblemodels that, in turn, increase in setup complexity.

For any type of DMA transfer, we always need to specify a startingsource and destination address for data. In the case of a peripheralDMA, the peripheral's FIFO serves as either the source or thedestination. When the peripheral serves as the source, a memorylocation (internal or external) serves as the destination address. Whenthe peripheral serves as the destination, a memory location (internalor external) serves as the source address.

In the simplest MemDMA case, we need to tell the DMA controller thesource address, the destination address and the number of words totransfer. With a peripheral DMA, we specify either the source or thedestination, depending on the direction of the transfer. The word sizeof each transfer can be either 8, 16 or 32 bits. This type oftransaction represents a simple one-dimensional (“1D”) transfer with aunity “stride.”

As part of this transfer, the DMA controller keeps track of thesource and destination addresses as they increment. With a unitystride, the address increments by 1 byte for 8-bit transfers, 2 bytesfor 16-bit transfers, and 4 bytes for 32-bit transfers. The aboveparameters configure a basic 1D DMA transfer, as shown in Figure 4, below .

Figure4: 1D DMA examples — (a) with unity stride, (b) with non-unity stride

We can add more flexibility to a one-dimensional DMA simply bychanging the stride. For example, with non-unity strides, we can skipaddresses in multiples of the transfer sizes. That is, specifying a32-bit transfer and striding by 4 samples results in an addressincrement of 16 bytes (four 32-bit words) after each transfer.

While the 1D DMA capability is widely used, the two-dimensional (2D)capability is even more useful, especially in video applications. The2D feature is a direct extension to what we discussed for 1D DMA. Inaddition to an XCOUNT and XMODIFY value, we also program correspondingYCOUNT and YMODIFY values. It is easiest to think of the 2D DMA as anested loop, where the inner loop is specified by XCOUNT and XMODIFY,and the outer loop is specified by YCOUNT and YMODIFY. A 1D DMA canthen be viewed simply as an “inner loop” of the 2D transfer of theform:

for y = 1 to YCOUNT /* 2D withouter loop */
       for x = 1 to XCOUNT /* 1D inner loop*/
       {
                      /* Transfer loop body goes here */
        }

While the XMODIFY determines the stride value the DMA controllertakes every time XCOUNT decrements, YMODIFY determines the stride takenwhenever YCOUNT decrements. As is the case with XCOUNT and XMODIFY,YCOUNT is specified in terms of the number of transfers, while YMODIFYis specified as a number of bytes. Notably, YMODIFY can be negative,which allows the DMA controller to wrap back around to the beginning ofthe buffer. We'll explore this feature shortly.

For a peripheral DMA, the “memory side” of the transfer can beeither 1D or 2D. On the peripheral side, though, it is always a 1Dtransfer. The only constraint is that the total number of bytestransferred on each side (source and destination) of the DMA have to bethe same. For example, if we were feeding a peripheral from three10-byte buffers, the peripheral would have to be set to transfer 30bytes using any possible combination of supported transfer width andtransfer count values available.

MemDMA offers a bit moreflexibility. For example, we can set up a 1D-to-1D transfer, a 1D-to-2Dtransfer, a 2D-to-1D transfer, and of course a 2D-to-2D transfer, asshown in Figure 5, below. Theonly constraint is that the total number of bytes being transferred oneach end of the DMA transfer block has to be the same.

Figure5: Possible Memory DMA configurations

Let's now look at some DMA configuration examples:

DMA Example 1: Pixel array
Consider a 4-pixel (per line) x 5-line array, with byte-sized pixelvalues, ordered as shown in Figure6a, below.

Figure6: Source and destination arrays for Example 1

While this data is shown as a matrix, it appears consecutively inmemory as shown in Figure 6b, above.

We now want to create the array shown in Figure 6c using the DMA controller.

The source and destination DMA register settings for this transferare:

Source                       Destination
XCOUNT =5              XCOUNT =20
XMODIFY = 4           XMODIFY = 1
YCOUNT = 4             YCOUNT = 0
YMODIFY = -15        YMODIFY = 0

Source and destination word transfer size = 1 byte per transfer.

Let's walk through the process. In this example, we can use aMemDMA, with a 2D-to-1D transfer configuration. Because the source is2D, it should be clear that the source channel's XCOUNT and YCOUNT are5 and 4, respectively, since the array size is 4 pixels/line x 5 lines.Because we will use a 1D transfer to fill the destination buffer, weonly need to program XCOUNT and XMODIFY on the destination side. Inthis case, the value of XCOUNT is set to 20, because that is the numberof bytes that will be transferred. The YCOUNT value for the destinationside is simply 0, and YMODIFY is also 0. You can see that the countvalues obey the rule we discussed earlier (e.g., 4×5 = 20 bytes).

Now let's talk about the correct values for XMODIFY and YMODIFY forthe source buffer. We want to take the first value (0x1) and skip 4bytes to the next value of 0x1. We will repeat this five times (SourceXCOUNT=5). The value of the source XMODIFY is 4, because that is thenumber of bytes the controller skips over to get to the next pixel(including the first pixel). XCOUNT decrements by 1 every time a pixelis collected.

When the DMA controller reaches the end of the first row, XCOUNTdecrements to 0, and YCOUNT decrements by 1. The value of YMODIFY onthe source side then needs to bring the address pointer back to thesecond element in the array (0x2). At the instant this happens, theaddress pointer is still pointing to the last element in the first row(0x1). Counting back from that point in the array to the second pixelin the first row, we traverse back by 15 elements. Therefore, thesource YMODIFY=-15.

If the core carried out this transfer without the aid of a DMAcontroller, it would consume valuable cycles to read and write eachpixel. Additionally, it would have to keep track of the addresses onthe source and destination sides, tracking the stride values with eachtransfer.

Here's a more complex example involving a 2D-to-2D transfer.

Example 2: 2D-to-2D transfer
Let's assume now we start with the array that has a border of 0xFFvalues, shown in Figure 7 below.

Figure7: Source and Destination arrays for Example 2

We want to keep only the inner square of the source matrix (shown inbold ), but we also want to rotate the matrix 90 degrees as shownon the right side of Figure 7, above.

The register settings below will produce the transformation shown inthis example, and now we will explain why.

Source                          Destination
XCOUNT =4                 XCOUNT =4
XMODIFY = 1              XMODIFY = 4
YCOUNT = 4                YCOUNT = 4
YMODIFY = 3              YMODIFY = -13

As a first step, we need to determine how to access data in thesource array. As the DMA controller reads each byte from the sourcearray, the destination builds the output array one byte at a time.

How do we get started? Well, let's look at the first byte that wewant to move in the input array. It is shown in italics as (0x1). Thiswill help us select the start address of the source buffer. We thenwant to sequentially read the next three bytes before we skip over the”border” bytes. The transfer size is assumed to 1 byte for thisexample.

Because the controller reads 4 bytes in a row before skipping oversome bytes to move to the next line in the array, the source XCOUNT is4. Because the controller increments the address by 1 as it collects0x2, 0x3, and 0x4, the source XMODIFY=1. When the controller finishesthe first line, the source YCOUNT decrements by 1. Since we aretransferring four lines, the source YCOUNT=4. Finally, the sourceYMODIFY=3, because as we discussed earlier, the address pointer doesnot increment by XMODIFY after XCOUNT goes from 1 to 0. SettingYMODIFY=3 ensures the next fetch will be 0x5.

On the destination side of the transfer, we will again program thelocation of the 0x1 byte as the initial destination address. Since thesecond byte fetched from the source address was 0x2, the controllerwill need to write this value to the destination address next. As youcan in see in the destination array in Figure7, above, the destination address has to first be incremented by4, which defines the destination XMODIFY value.

Since the destination array is 4×4 in size, the values of both thedestination XCOUNT and YCOUNT are 4. The only value left is thedestination YMODIFY. To calculate this value, we must compute how manybytes the destination address moves back in the array. After thedestination YCOUNT decrements for the first time, the destinationaddress is pointing to the value 0x4. The resulting destination YMODIFYvalue of -13 will ensure that a value of 0x5 is written to the desiredlocation in the destination buffer.

In Part 2 in this series, we'll digdeeper into DMA, discussing thetwo main transfer classifications – Register-based and Descriptor-based – and when to use each type.

Thisseries of fourarticles is based on material from “EmbeddedMedia Processing,” by David Katz and Rick Gentile, published byNewnes/Elsevier

Rick Gentile and David Katz are senior DSP applications engineers in the Blackfin ApplicationsGroup at Analog Devices, Inc

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.