Audio engineers face the challenge of designing equipment that provides
better audio fidelity, supports more channels of audio, and handles
higher sampling rates and bit depths, while maintaining a tight,
real-time processing budget.
In many professional audio applications, the primary bottleneck in
system performance is the efficient movement of audio data. Over the
years, various innovations have been introduced to digital signal
processor (DSP) architectures that offload many I/O or data movement
tasks from the DSP core, allowing it to concentrate on signal
processing tasks.
The Direct Memory Access (DMA) engine is a critical component of
most high-performance DSPs today. Instead of having to explicitly
access memory or peripherals, the DSP can configure the DMA engine to
access the on- and off-chip resources, and facilitate the transfers
between them. These DMA transfers can be performed in parallel with
critical DSP core processing for optimal performance.
Standard DMA engines are well-suited for traditional 1-D and 2-D
algorithm processing, such as block copies and basic data sorting.
However, many audio algorithms require more complex data transfers. An
example of this would be a delay line, which is made up of audio
samples from a previous point in time and used to create a desired
audio effect (e.g. an echo). The traditional DMA performance is not
optimal to manage the delay line, requiring innovations to be made in
the DMA architecture to efficiently process the required audio
algorithms.
Is there a need for DMA
acceleration?
The answer to this question is yes, for two reasons. First, the number
of DMA channels in many high performance DSP engines limits
professional (pro) audio application. Second, because of the demands
for high quality audio, traditional DMA in pro audio application often
require more CPU involvement
 |
| Figure
1. Pro audio application block diagram |
The block diagram above describes data flow in a typical pro audio
application. Each effect takes the output from the previous effect,
processes the data, and forwards its output to the next effect in the
data processing chain (e.g., output from the Phaser effect is input to
Delay effect, and output from Delay effect is sent to the Reverb).
The digital audio effect pictured above relies on delay lines for
their implementation. In describing a complete system of effects
multiple delay lines are required.. Varying the length of delays used
in the design changes the quality of the audio effect.
A delay line is a linear time-invariant system, with an output
signal that is a copy of the input signal delayed by x samples. The
most effective way of implementing a delay line on a DSP is to use a
circular buffer. A circular buffer is stored in a dedicated section of
linear memory; when the buffer is filled, new data is written, starting
at the beginning of the buffer.
Circular buffer data is written by one process and read by another
process, which requires a separate read and write pointer. The read and
write pointers are not allowed to cross each other, so that unread data
cannot be overwritten by new data. The size of the circular buffer is
dictated by the largest delay required by the effect. In this article,
the First In First Out (FIFO) and circular buffer names are used
interchangeably.
When a traditional DMA engine is used to move data in a delay-based
audio effect, a separate circular buffer is assigned to each effect
from the signal processing chain. The input data fed to a particular
audio effect is stored in the circular buffer assigned to the effect. A
more detailed data flow is shown in the block diagram below. In the
block diagram in Figure 2 below, circular buffers are represented by
rings. The ring representation for circular buffer is used, since it
shows wrapping of the linear address space assigned to a circular
buffer. As the pointers advance through a circular buffer, the address
will increase until the wraparound condition is hit, causing the
pointer to reset to the lowest memory address or the starting point of
the circular buffer.
 |
| Figure
2. Pro audio application block diagram of data flow when traditional
DMA engine is used |
To produce different delays, a DMA must retrieve delayed data from
different locations within the delay line. If block processing is used,
a block of data is retrieved instead of just one sample.
Traditional DMA engines usually allow programmers to specify several
parameters that entirely describe the desired transfer. Typically,
these parameters are the source address, destination address, the
indexes for the source and destination, as well as the transfer count.
Each DMA transfer would require one channel of a typical DMA overall
capabilities.
In the block diagram above, there are five circular buffers. A
traditional DMA engine must be programmed to move data in and out from
each of these buffers. In the application shown above, a minimum of 11
DMA transfers are required to process one block of data.
This is the absolute minimum number of DMA transfers required,
assuming that only one delay per effect is retrieved from each circular
buffer. In a typical application, the number of DMA transfers per data
block would be much higher. For example, reverb effect implementation
always requires more than one delay from its circular buffer.
The number of required traditional DMA transfers will increase as
the number of implemented audio effects increase. Therefore, the
maximum number of traditional DMA channels available in a system can
limit the number of audio effects that can be implemented.
Limits of traditional DMA in pro
audio applications
Standard DMA engines perform quite well when moving long blocks of data
at either contiguous or fixed intervals. An example of a fixed interval
transfer is when the DMA engine accesses every fourth data sample of
the delay line.
Typical DMA performance is not optimal, when accesses are not
contiguous or at fixed intervals. When a traditional DMA engine moves
circular buffer data to generate digital audio effects, the CPU must
intervene to program the DMA parameters at least twice while processing
one data block. The CPU needs to program the DMA parameters, and
intervene in managing delay lines, when data accesses wrap around a
circular buffer boundaries.
 |
| Figure
3. Chorus block diagram |
A simple algorithm example illustrating this point is the chorus
effect, shown in Figure 3 above. The chorus effect is often used to
alter the sound of an instrument to make it sound as if multiple
instruments are playing, If the instrument where a human voice, then
this effect would tend to make the single voice sound like a choir. We
perceive the multiple voices or instruments, since there is always
imprecise synchronization and slight pitch variation when multiple
voices or instruments are playing at the same time. These are the
principal characteristics of a chorus effect.
In Figure 3, Chorus is presented as a combination of the input with
two of its delayed copies. The pitch deviation is modeled by a slowly
varying amount of delay in the delayed input copies. The delay is
slowly varying and amount of the deviation and its frequency is
controlled by a low frequency oscillator (LFO).
As shown in the Chorus implementation diagram in Figure 4, below,
the delay line is implemented by using a circular buffer (presented by
two concentric circles). The chorus implementation presented in Figure
4 implies use of block processing. Block size in this chorus example is
four samples. Incoming samples are stored to the circular buffer in
clockwise direction.
 |
| Figure
4. Block diagram of chorus implementation |
Block processing manages a block (multiple samples) of data at the
same time, instead of only one sample at a time,. In this example, the
CPU waits until four input samples are available, and then calculates
four output samples. It processes these samples by combining a block of
input samples with two blocks of delayed data fetched from the circular
buffer.
In the case that a traditional DMA controller is used (Figure 5,
below), the CPU is notified by an interrupt every time a block of input
data is ready. The CPU then calculates chorus output.
 |
| Figure
5. Chorus implementation timeline when traditional DMA is employed |
The DMA engine assignment in this example must perform the two key
operations:
1) Store a block of input
samples to the circular buffer (for future reference) 2) Retrieve two blocks of delayed
data from the circular buffer (prepare delayed data for the next block
of input samples).
In this case, the CPU must assist the DMA by tracking and
programming the source and destination addresses—and by intervening
when data accesses wrap around the buffer boundaries. This requires a
reconfiguration of the DMA engine before each transfer.
Each offset must be calculated by the CPU (or taken from a
pre-calculated table) before the CPU reconfigures the DMA. The CPU
bandwidth is utilized, since it must reconfigure the DMA engine before
each transfer. In Figure 5, the CPU timeline activity is presented on
two lines: on the first line, CPU activity required to process chorus
effect is presented, and on the second line, CPU activity required to
configure DMA is shown.
In the case of complex digital audio effects, such as reverb, the
number of delayed blocks that must be retrieved from a circular buffer
can reach 256 or more. In addition, each of these delay blocks are not
at fixed intervals and the offsets change continually, as the algorithm
runs. With the dramatic increase in the amount of accesses to data in
the circular buffer, the more complicated digital audio effect
algorithms, like reverb, will demand more CPU cycles. This leaves less
CPU bandwidth that can be dedicated to actual application.
When several digital audio effects follow one after another (as
shown in Figure 1), the CPU will have to assist the DMA in moving data
required and produced by each processing stage. The CPU and the DMA
must be synchronized during these tasks. The synchronization is
facilitated by the DMA, which interrupts the CPU.
Therefore, the number of interrupts in the system will rise as
system complexity increases. These interrupts come with a high
overhead, since registers must be saved to preserve context. In
addition to this, interrupts also go through the processing pipeline
and disrupt the delicate efficiency of the instruction cache.
Preserving the context can consume a good number of cycles, as well as
further alter the performance of the instruction cache. Excessive
interruptions of the pipeline also directly effects overall
performance.
The advantages of audio-enhanced
DMA
The traditional DMA engines are not a perfect fit for pro audio
applications. An ideal DMA engine tailored for a pro audio application
must:
1) Be smart enough to
offload the CPU core from having to perform data movement calculations
2) Keep the number of required
transfer-related interrupts in a system low (the number of data
transfer- related interrupts should not rise as system complexity
increases)
3) Keep the number of DMA
channels required for the pro audio application independent from the
complexity of the application.
An audio-enhanced DMA engine should be designed to accelerate audio
processing algorithms through architectural innovations that offload
processing from the CPU core and increase the overall level of
concurrent and parallel processing the DSP is able to achieve. TI
introduced an audio-enhanced DMA engine in TI’s C672x family of
high-performance DSPs. This architecture can perform the 1-D, 2-D and
3-D transfers of a standard DMA engine, as well as perform optimized
delay-based transfers common in audio effects.
To perform circular buffer transfers efficiently, we implemented the
audio-enhanced DMA engine as a dual data movement accelerator (dMAX) that is being used in the
C672x family of high-performance DSPs that makes use of Table-Guided
FIFO Transfers. This engine can perform the 1-D, 2-D and 3-D transfers
of a standard DMA engine, as well as perform optimized delay-based
transfers common in audio effects.
A Table-Guided FIFO Transfer moves a number of taps (blocks of data)
between the circular buffer and the linear memory. With a circular
buffer read, a delay table guides the audio-enhanced DMA engine
controller to retrieve only taps at specified offsets from the FIFO
Read Pointer (RP). In the same way, a circular buffer write, a delay
table guides the controller to write taps at specified offsets from the
FIFO Write Pointer (WP).
Adding support for table guided FIFO transfers enables the
audio-enhanced DMA engine controller to divide one circular buffer into
multiple sections, where each section can correspond to either
different channels or to different audio effects. In other words,
support for the table-guided FIFO transfers enables the controller to
use only one circular buffer per system. The block diagram in Figure 6,
below shows the data flow for an audio-enhanced DMA engine
implementation of the system from Figure 1.
 |
| Figure
6. Pro audio application block diagram of data flow when dMAX engine is
used |
The circular buffer in Figure 6 is divided into five sections, and
each section is assigned to one audio effect from Figure 1. With only
one circular buffer per system managing all required circular buffer
reads, only one FIFO read is needed to transfer the parameter entry.
Thus, all required circular buffer writes can be handled by only one
FIFO write transfer parameter entry. In other words, two FIFO transfer
parameters are required to describe all required circular buffer
transfers in the system.
A maximum of 32,768 taps can be retrieved or stored from/to the
circular buffer during one FIFO read/write. An increase in the
application complexity does not require an increase of audio-enhanced
DMA engine channels. Rather, it increases the number of taps moved
between the FIFO and the linear memory, while all data transfers can
still be managed by using only two channels.
Showing an application where one block of data is moved to or from a
circular
buffer, the code listings to the right
illustrate the CPU flow when a traditional DMA engine is used, with the
alternative for that used in the
alternative audio-enhanced DMA engine architecture below and to the
left.
Note that when the number of required delays grows, the code listing
presented in the panel to the right becomes more complex. However, the
dMAX listing shown below demonstrates that when the
audio-enhanced
DMA
engine is used the CPU workload is no longer a function of the number
of delays that must be accessed in the system.
With a traditional DMA, each offset must be individually computed
and transferred to the DMA engine for every tap in the filter. In the
alternative audio-enhanced approach, the controller requires only
initialization of parameters and after that the controller is capable
of automatically maintaining transfer states for all taps.
A programmer can pre-calculate several offset tables for a Table
Guided FIFO Transfer. The tables are referenced by pointers, and
developers can employ a ping-pong technique to switch between different
tables on-the-fly. The term ping-pong refers to the use of multiple
tables that the developer can move between.
As shown in the
dMAX code listings
below, while the Table Guided
FIFO Transfers significantly reduce the interaction between the CPU and
the dual data movement accelerator compared to a traditional DMA
engine, the most significant savings come from a reduction in the
number of interrupts the CPU must handle.

This simple and
efficient scheme
keeps the number of interrupts in the system low, even when
system complexity increases. Reduction in the number of required
interrupts in a system is implied, as the dMAX requires less data
channels for its transfers.
Real-World acceleration
In effect, the architectural enhancements of audio-enhanced DMA
engines
significantly offload the CPU core from having to perform data movement
calculations and management, enabling the CPU to commit more cycles to
what it does best: Multiply Accumulate (MAC) processing.
The benefits to an audio DSP’s performance, when paired with an
audio-enhanced DMA, can be quite substantial. In one experiment
involving the implementation of a Schoeder Reverb algorithm, by
leveraging the Table Guided FIFO’s of the dual data movement
accelerator, the CPU utilization was dropped from 20 percent to 3
percent, achieving a 6x improvement in performance. (see Table 1,
below).
 |
| Table
1: Performance of a stereo 6-tap delay line reverb using varying DMA
Architectures |
Another important architectural characteristic of the audio-enhanced
DMA engine approach is the fact that it has dual engines, operating
independently of each other. Each engine has its own master to the
crossbar switch that connects all the peripherals on the device.
Such comprehensive interconnection enables it to facilitate and
accelerate transfers between I/O, memory and processing resources. The
availability of two engines further enhances the movement of data
throughout the CPU.
For example, one engine can move data from an SDRAM connected to the
External Memory interface (EMIF) to data memory, while real-time data
from an Multi-Channel Audio Serial Port (McASP) interface is stored in
another section of memory without any overlap or contention between the
two engines. Multi-channel audio applications will particularly benefit
from the concurrent transfer operations that are enabled by the dual
data movement accelerator structure.
In effect, the dual engine design of this alternative DMA
architecture can effectively double the data transfer capacity of the
DSP, since transfers can take place in an interleaved fashion. One dMAX
unit can transfer to the port during the processing overhead of the
other dMAX unit’s transfer.
By reducing CPU core involvement in data transfers, audio-enhanced
DMA engines compound performance savings by not only offloading data
transfer management functions from the CPU, but also by executing them
in parallel to the CPU, enabling audio engineers to bring higher
quality audio processing to a new level.
Zoran Nikolic is Senior Applications Engineer and Gerard Andrews
is DSP Audio Marketing Manager at Texas
Instruments Inc.