Using Direct Memory Access effectively in media-based embedded applications - Part 1An embedded processor core is capable of doing multiple operations in a single cycle, including calculations, data fetches, data stores and pointer increments/decrements. In addition, the core can orchestrate data transfer between internal and external memory spaces by moving data into and out of the register file.
All this sounds great, but in reality, you can only achieve optimum performance in your application if data can move around without constantly bothering the core to perform the transfers.
This is where a direct memory access (DMA) controller comes into play. Processors need DMA capability to relieve the core from these transfers between internal/external memory and peripherals, or between memory spaces (Memory DMA, or "MemDMA").
There are two main types of DMA controllers. "Cycle-stealing" DMA uses spare (idle) core cycles to perform data transfers. This is not a workable solution for systems with heavy processing loads like multimedia flows. Instead, it is much more efficient to employ the second type: a DMA controller that operates independently from the core.
Why is this so important? Well, imagine if a processor's video port has a FIFO that needs to be read every time a data sample is available. In this case, the core has to be interrupted tens of millions of times each second. As if that's not disruptive enough, the core has to perform an equal amount of writes to some destination in memory. For every core processing cycle spent on this task, a corresponding cycle would be lost in the processing loop.
As good as DMA sounds in theory, PC-based software designers
transitioning to the embedded world are hesitant to rely on a DMA
controller for moving data around in an application. This reluctance
usually stems from their impression that the complexity of the
programming model increases exponentially when DMA is factored in.
Our goal, however, is to put your mind at ease, to show you how DMA is truly your friend. In this series of articles, we'll focus on the DMA controller itself, then show you how to optimize performance with DMA, and finally offer ideas on how best to manage the DMA controller as part of an overall framework.
Let's take a quick aside to discuss memory space nomenclature. Embedded processors have hierarchical memory architectures that strive to balance several levels of memory with differing sizes and performance levels. The memory closest to the core processor (known as "Level 1," or "L1," memory) operates at the full core clock rate.
The use of the term "closest" is literal, in that L1 memory is physically close to the core processor on the silicon die, so as to achieve the highest access and operating speeds. L1 memory is most often partitioned into Instruction and Data segments for efficient utilization of memory bus bandwidth.
Of course, L1 memory is necessarily limited in size. For systems that require larger code sizes, additional on-chip and off-chip memory is available—with increased latency. Larger on-chip memory is called Level 2 ("L2") memory, and we refer to external memory as Level 3 ("L3") memory. While the L1 memory size usually comprises tens of kBytes, the L2 memory on-chip is measured in hundreds of kBytes, and L3 can easily be megabytes.
The basics of DMA control
OK " back to the discussion at hand. A DMA controller is a unique peripheral devoted to moving data around a system. Think of it as a controller that connects internal and external memories with each DMA-capable peripheral via a set of dedicated buses. It is a peripheral in the sense that the processor programs it to perform transfers.
It is unique in that it interfaces to both memory and selected peripherals. Notably, only peripherals where data flow is significant (kBytes per second or greater) need to be DMA-capable. Good examples of these are video, audio and network interfaces. Lower-bandwidth peripherals can also be equipped with DMA capability, but it's less of an imposition on the core to step in and assist with data transfer on these interfaces.
In general, DMA controllers will include an address bus, a data bus, and control registers. An efficient DMA controller will possess the ability to request access to any resource it needs, without having the processor itself get involved. It must have the capability to generate interrupts. Finally, it has to be able to calculate addresses within the controller.
A processor might contain multiple DMA controllers. Each controller has multiple DMA channels, as well as multiple buses that link directly to the memory banks and peripherals, as shown in Figure 1 below. There are two types of DMA controllers in many high-performance processors. The first category, usually referred to as a System DMA Controller, allows access to any resource (peripherals and memory).
Cycle counts for this type of controller are measured in System Clocks (SCLKs) at frequencies up to 133MHz (using ADI's Blackfin processor as an example). The second type, an Internal Memory DMA controller (IMDMA), is dedicated to accesses between internal memory locations. Because the accesses are internal (L1 to L1, L1 to L2, or L2 to L2), cycle counts are measured in Core Clocks (CCLKs), which can exceed 600MHz rates.
|Figure 1: System and internal memory DMA architecture|
Each DMA controller has a set of FIFOs that act as a buffer between the DMA subsystem and peripherals or memory. For MemDMA, a FIFO exists on both the source and destination sides of the transfer. The FIFO improves performance by providing a place to hold data while busy resources are preventing a transfer from completing.
Configuring a DMA controller
Because you'll typically configure a DMA controller during code initialization, the core should only need to respond to interrupts after data set transfers are complete. You can program the DMA controller to move data in parallel with the core, while the core is doing its basic processing tasks " the jobs on which it's supposed to be focused!
In an optimized application, the core would never have to move any data, but rather only access it in L1 memory. The core wouldn't need to wait for data to arrive, because the DMA engine would have already made it available by the time the core was ready to access it. Figure 2 below shows a typical interaction between the processor and the DMA controller. The steps allocated to the processor involve setting up the transfer, enabling interrupts, and running code when an interrupt is generated. The interrupt input back to the processor can be used to signal that data is ready for processing.
|Figure 2: DMA Controller|
In addition to moving to and from peripherals, data also needs to move from one memory space to another. For example, source video might flow from a video port straight to L3 memory, because the working buffer size is too large to fit into internal memory. We don't want to make the processor fetch pixels from external memory every time we need to perform a calculation, so a memory-to-memory DMA ("MemDMA") can bring pixels into L1 or L2 memory for more efficient access times. Figure 3 below shows some typical DMA data flows.
|Figure 3: Typical DMA flows|
So far we've focused on data movement, but a DMA transfer doesn't always have to involve data. We can use code overlays to improve performance, configuring the DMA controller to move code into L1 Instruction memory before execution. The code is usually staged in larger external memory and selectively brought into L1 as needed.
Programming the DMA controller
Let's take a look at what options we have in specifying DMA activity. We will start with the simplest model and build up to more flexible models that, in turn, increase in setup complexity.
For any type of DMA transfer, we always need to specify a starting source and destination address for data. In the case of a peripheral DMA, the peripheral's FIFO serves as either the source or the destination. When the peripheral serves as the source, a memory location (internal or external) serves as the destination address. When the peripheral serves as the destination, a memory location (internal or external) serves as the source address.
In the simplest MemDMA case, we need to tell the DMA controller the source address, the destination address and the number of words to transfer. With a peripheral DMA, we specify either the source or the destination, depending on the direction of the transfer. The word size of each transfer can be either 8, 16 or 32 bits. This type of transaction represents a simple one-dimensional ("1D") transfer with a unity "stride."
As part of this transfer, the DMA controller keeps track of the source and destination addresses as they increment. With a unity stride, the address increments by 1 byte for 8-bit transfers, 2 bytes for 16-bit transfers, and 4 bytes for 32-bit transfers. The above parameters configure a basic 1D DMA transfer, as shown in Figure 4, below.
|Figure 4: 1D DMA examples -- (a) with unity stride, (b) with non-unity stride|
We can add more flexibility to a one-dimensional DMA simply by changing the stride. For example, with non-unity strides, we can skip addresses in multiples of the transfer sizes. That is, specifying a 32-bit transfer and striding by 4 samples results in an address increment of 16 bytes (four 32-bit words) after each transfer.
While the 1D DMA capability is widely used, the two-dimensional (2D) capability is even more useful, especially in video applications. The 2D feature is a direct extension to what we discussed for 1D DMA. In addition to an XCOUNT and XMODIFY value, we also program corresponding YCOUNT and YMODIFY values. It is easiest to think of the 2D DMA as a nested loop, where the inner loop is specified by XCOUNT and XMODIFY, and the outer loop is specified by YCOUNT and YMODIFY. A 1D DMA can then be viewed simply as an "inner loop" of the 2D transfer of the form:
for y = 1 to YCOUNT /* 2D with
outer loop */
for x = 1 to XCOUNT /* 1D inner loop */
/* Transfer loop body goes here */
While the XMODIFY determines the stride value the DMA controller takes every time XCOUNT decrements, YMODIFY determines the stride taken whenever YCOUNT decrements. As is the case with XCOUNT and XMODIFY, YCOUNT is specified in terms of the number of transfers, while YMODIFY is specified as a number of bytes. Notably, YMODIFY can be negative, which allows the DMA controller to wrap back around to the beginning of the buffer. We'll explore this feature shortly.
For a peripheral DMA, the "memory side" of the transfer can be either 1D or 2D. On the peripheral side, though, it is always a 1D transfer. The only constraint is that the total number of bytes transferred on each side (source and destination) of the DMA have to be the same. For example, if we were feeding a peripheral from three 10-byte buffers, the peripheral would have to be set to transfer 30 bytes using any possible combination of supported transfer width and transfer count values available.
MemDMA offers a bit more flexibility. For example, we can set up a 1D-to-1D transfer, a 1D-to-2D transfer, a 2D-to-1D transfer, and of course a 2D-to-2D transfer, as shown in Figure 5, below. The only constraint is that the total number of bytes being transferred on each end of the DMA transfer block has to be the same.
|Figure 5: Possible Memory DMA configurations|
Let's now look at some DMA configuration examples:
DMA Example 1: Pixel array
Consider a 4-pixel (per line) x 5-line array, with byte-sized pixel values, ordered as shown in Figure 6a, below.
|Figure 6: Source and destination arrays for Example 1|
While this data is shown as a matrix, it appears consecutively in memory as shown in Figure 6b, above.
We now want to create the array shown in Figure 6c using the DMA controller.
The source and destination DMA register settings for this transfer are:
XCOUNT =5 XCOUNT =20
XMODIFY = 4 XMODIFY = 1
YCOUNT = 4 YCOUNT = 0
YMODIFY = -15 YMODIFY = 0
Source and destination word transfer size = 1 byte per transfer.
Let's walk through the process. In this example, we can use a MemDMA, with a 2D-to-1D transfer configuration. Because the source is 2D, it should be clear that the source channel's XCOUNT and YCOUNT are 5 and 4, respectively, since the array size is 4 pixels/line x 5 lines. Because we will use a 1D transfer to fill the destination buffer, we only need to program XCOUNT and XMODIFY on the destination side. In this case, the value of XCOUNT is set to 20, because that is the number of bytes that will be transferred. The YCOUNT value for the destination side is simply 0, and YMODIFY is also 0. You can see that the count values obey the rule we discussed earlier (e.g., 4x5 = 20 bytes).
Now let's talk about the correct values for XMODIFY and YMODIFY for the source buffer. We want to take the first value (0x1) and skip 4 bytes to the next value of 0x1. We will repeat this five times (Source XCOUNT=5). The value of the source XMODIFY is 4, because that is the number of bytes the controller skips over to get to the next pixel (including the first pixel). XCOUNT decrements by 1 every time a pixel is collected.
When the DMA controller reaches the end of the first row, XCOUNT decrements to 0, and YCOUNT decrements by 1. The value of YMODIFY on the source side then needs to bring the address pointer back to the second element in the array (0x2). At the instant this happens, the address pointer is still pointing to the last element in the first row (0x1). Counting back from that point in the array to the second pixel in the first row, we traverse back by 15 elements. Therefore, the source YMODIFY=-15.
If the core carried out this transfer without the aid of a DMA controller, it would consume valuable cycles to read and write each pixel. Additionally, it would have to keep track of the addresses on the source and destination sides, tracking the stride values with each transfer.
Here's a more complex example involving a 2D-to-2D transfer.
Example 2: 2D-to-2D transfer
Let's assume now we start with the array that has a border of 0xFF values, shown in Figure 7 below.
|Figure 7: Source and Destination arrays for Example 2|
We want to keep only the inner square of the source matrix (shown in bold), but we also want to rotate the matrix 90 degrees as shown on the right side of Figure 7, above.
The register settings below will produce the transformation shown in this example, and now we will explain why.
XCOUNT =4 XCOUNT =4
XMODIFY = 1 XMODIFY = 4
YCOUNT = 4 YCOUNT = 4
YMODIFY = 3 YMODIFY = -13
As a first step, we need to determine how to access data in the source array. As the DMA controller reads each byte from the source array, the destination builds the output array one byte at a time.
How do we get started? Well, let's look at the first byte that we want to move in the input array. It is shown in italics as (0x1). This will help us select the start address of the source buffer. We then want to sequentially read the next three bytes before we skip over the "border" bytes. The transfer size is assumed to 1 byte for this example.
Because the controller reads 4 bytes in a row before skipping over some bytes to move to the next line in the array, the source XCOUNT is 4. Because the controller increments the address by 1 as it collects 0x2, 0x3, and 0x4, the source XMODIFY=1. When the controller finishes the first line, the source YCOUNT decrements by 1. Since we are transferring four lines, the source YCOUNT=4. Finally, the source YMODIFY=3, because as we discussed earlier, the address pointer does not increment by XMODIFY after XCOUNT goes from 1 to 0. Setting YMODIFY=3 ensures the next fetch will be 0x5.
On the destination side of the transfer, we will again program the location of the 0x1 byte as the initial destination address. Since the second byte fetched from the source address was 0x2, the controller will need to write this value to the destination address next. As you can in see in the destination array in Figure 7, above, the destination address has to first be incremented by 4, which defines the destination XMODIFY value.
Since the destination array is 4x4 in size, the values of both the destination XCOUNT and YCOUNT are 4. The only value left is the destination YMODIFY. To calculate this value, we must compute how many bytes the destination address moves back in the array. After the destination YCOUNT decrements for the first time, the destination address is pointing to the value 0x4. The resulting destination YMODIFY value of -13 will ensure that a value of 0x5 is written to the desired location in the destination buffer.
In Part 2 in this series, we'll dig
deeper into DMA, discussing the
two main transfer classifications - Register-based and Descriptor-based - and when to use each type.
This series of four articles is based on material from "Embedded Media Processing," by David Katz and Rick Gentile, published by Newnes/Elsevier
Rick Gentile and David Katz are senior DSP applications engineers in the Blackfin Applications Group at Analog Devices, Inc