CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Accelerating complex audio DSP algorithms with audio-enhanced DMA



Embedded.com
Audio engineers face the challenge of designing equipment that provides better audio fidelity, supports more channels of audio, and handles higher sampling rates and bit depths, while maintaining a tight, real-time processing budget.

In many professional audio applications, the primary bottleneck in system performance is the efficient movement of audio data. Over the years, various innovations have been introduced to digital signal processor (DSP) architectures that offload many I/O or data movement tasks from the DSP core, allowing it to concentrate on signal processing tasks.

The Direct Memory Access (DMA) engine is a critical component of most high-performance DSPs today. Instead of having to explicitly access memory or peripherals, the DSP can configure the DMA engine to access the on- and off-chip resources, and facilitate the transfers between them. These DMA transfers can be performed in parallel with critical DSP core processing for optimal performance.

Standard DMA engines are well-suited for traditional 1-D and 2-D algorithm processing, such as block copies and basic data sorting. However, many audio algorithms require more complex data transfers. An example of this would be a delay line, which is made up of audio samples from a previous point in time and used to create a desired audio effect (e.g. an echo). The traditional DMA performance is not optimal to manage the delay line, requiring innovations to be made in the DMA architecture to efficiently process the required audio algorithms.

Is there a need for DMA acceleration?
The answer to this question is yes, for two reasons. First, the number of DMA channels in many high performance DSP engines limits professional (pro) audio application. Second, because of the demands for high quality audio, traditional DMA in pro audio application often require more CPU involvement

Figure 1. Pro audio application block diagram

The block diagram above describes data flow in a typical pro audio application. Each effect takes the output from the previous effect, processes the data, and forwards its output to the next effect in the data processing chain (e.g., output from the Phaser effect is input to Delay effect, and output from Delay effect is sent to the Reverb).

The digital audio effect pictured above relies on delay lines for their implementation. In describing a complete system of effects multiple delay lines are required.. Varying the length of delays used in the design changes the quality of the audio effect.

A delay line is a linear time-invariant system, with an output signal that is a copy of the input signal delayed by x samples. The most effective way of implementing a delay line on a DSP is to use a circular buffer. A circular buffer is stored in a dedicated section of linear memory; when the buffer is filled, new data is written, starting at the beginning of the buffer.

Circular buffer data is written by one process and read by another process, which requires a separate read and write pointer. The read and write pointers are not allowed to cross each other, so that unread data cannot be overwritten by new data. The size of the circular buffer is dictated by the largest delay required by the effect. In this article, the First In First Out (FIFO) and circular buffer names are used interchangeably.

When a traditional DMA engine is used to move data in a delay-based audio effect, a separate circular buffer is assigned to each effect from the signal processing chain. The input data fed to a particular audio effect is stored in the circular buffer assigned to the effect. A more detailed data flow is shown in the block diagram below. In the block diagram in Figure 2 below, circular buffers are represented by rings. The ring representation for circular buffer is used, since it shows wrapping of the linear address space assigned to a circular buffer. As the pointers advance through a circular buffer, the address will increase until the wraparound condition is hit, causing the pointer to reset to the lowest memory address or the starting point of the circular buffer.

Figure 2. Pro audio application block diagram of data flow when traditional DMA engine is used

To produce different delays, a DMA must retrieve delayed data from different locations within the delay line. If block processing is used, a block of data is retrieved instead of just one sample.

Traditional DMA engines usually allow programmers to specify several parameters that entirely describe the desired transfer. Typically, these parameters are the source address, destination address, the indexes for the source and destination, as well as the transfer count. Each DMA transfer would require one channel of a typical DMA overall capabilities.

In the block diagram above, there are five circular buffers. A traditional DMA engine must be programmed to move data in and out from each of these buffers. In the application shown above, a minimum of 11 DMA transfers are required to process one block of data.

This is the absolute minimum number of DMA transfers required, assuming that only one delay per effect is retrieved from each circular buffer. In a typical application, the number of DMA transfers per data block would be much higher. For example, reverb effect implementation always requires more than one delay from its circular buffer.

The number of required traditional DMA transfers will increase as the number of implemented audio effects increase. Therefore, the maximum number of traditional DMA channels available in a system can limit the number of audio effects that can be implemented.

Limits of traditional DMA in pro audio applications
Standard DMA engines perform quite well when moving long blocks of data at either contiguous or fixed intervals. An example of a fixed interval transfer is when the DMA engine accesses every fourth data sample of the delay line.

Typical DMA performance is not optimal, when accesses are not contiguous or at fixed intervals. When a traditional DMA engine moves circular buffer data to generate digital audio effects, the CPU must intervene to program the DMA parameters at least twice while processing one data block. The CPU needs to program the DMA parameters, and intervene in managing delay lines, when data accesses wrap around a circular buffer boundaries.

Figure 3. Chorus block diagram

A simple algorithm example illustrating this point is the chorus effect, shown in Figure 3 above. The chorus effect is often used to alter the sound of an instrument to make it sound as if multiple instruments are playing, If the instrument where a human voice, then this effect would tend to make the single voice sound like a choir. We perceive the multiple voices or instruments, since there is always imprecise synchronization and slight pitch variation when multiple voices or instruments are playing at the same time. These are the principal characteristics of a chorus effect.

In Figure 3, Chorus is presented as a combination of the input with two of its delayed copies. The pitch deviation is modeled by a slowly varying amount of delay in the delayed input copies. The delay is slowly varying and amount of the deviation and its frequency is controlled by a low frequency oscillator (LFO).

As shown in the Chorus implementation diagram in Figure 4, below, the delay line is implemented by using a circular buffer (presented by two concentric circles). The chorus implementation presented in Figure 4 implies use of block processing. Block size in this chorus example is four samples. Incoming samples are stored to the circular buffer in clockwise direction.

Figure 4. Block diagram of chorus implementation

Block processing manages a block (multiple samples) of data at the same time, instead of only one sample at a time,. In this example, the CPU waits until four input samples are available, and then calculates four output samples. It processes these samples by combining a block of input samples with two blocks of delayed data fetched from the circular buffer.

In the case that a traditional DMA controller is used (Figure 5, below), the CPU is notified by an interrupt every time a block of input data is ready. The CPU then calculates chorus output.

Figure 5. Chorus implementation timeline when traditional DMA is employed

The DMA engine assignment in this example must perform the two key operations:

1) Store a block of input samples to the circular buffer (for future reference) 2) Retrieve two blocks of delayed data from the circular buffer (prepare delayed data for the next block of input samples).

In this case, the CPU must assist the DMA by tracking and programming the source and destination addresses—and by intervening when data accesses wrap around the buffer boundaries. This requires a reconfiguration of the DMA engine before each transfer.

Each offset must be calculated by the CPU (or taken from a pre-calculated table) before the CPU reconfigures the DMA. The CPU bandwidth is utilized, since it must reconfigure the DMA engine before each transfer. In Figure 5, the CPU timeline activity is presented on two lines: on the first line, CPU activity required to process chorus effect is presented, and on the second line, CPU activity required to configure DMA is shown.

In the case of complex digital audio effects, such as reverb, the number of delayed blocks that must be retrieved from a circular buffer can reach 256 or more. In addition, each of these delay blocks are not at fixed intervals and the offsets change continually, as the algorithm runs. With the dramatic increase in the amount of accesses to data in the circular buffer, the more complicated digital audio effect algorithms, like reverb, will demand more CPU cycles. This leaves less CPU bandwidth that can be dedicated to actual application.

When several digital audio effects follow one after another (as shown in Figure 1), the CPU will have to assist the DMA in moving data required and produced by each processing stage. The CPU and the DMA must be synchronized during these tasks. The synchronization is facilitated by the DMA, which interrupts the CPU.

Therefore, the number of interrupts in the system will rise as system complexity increases. These interrupts come with a high overhead, since registers must be saved to preserve context. In addition to this, interrupts also go through the processing pipeline and disrupt the delicate efficiency of the instruction cache. Preserving the context can consume a good number of cycles, as well as further alter the performance of the instruction cache. Excessive interruptions of the pipeline also directly effects overall performance.

The advantages of audio-enhanced DMA
The traditional DMA engines are not a perfect fit for pro audio applications. An ideal DMA engine tailored for a pro audio application must:

1) Be smart enough to offload the CPU core from having to perform data movement calculations
2) Keep the number of required transfer-related interrupts in a system low (the number of data transfer- related interrupts should not rise as system complexity increases)
3) Keep the number of DMA channels required for the pro audio application independent from the complexity of the application.

An audio-enhanced DMA engine should be designed to accelerate audio processing algorithms through architectural innovations that offload processing from the CPU core and increase the overall level of concurrent and parallel processing the DSP is able to achieve. TI introduced an audio-enhanced DMA engine in TI’s C672x family of high-performance DSPs. This architecture can perform the 1-D, 2-D and 3-D transfers of a standard DMA engine, as well as perform optimized delay-based transfers common in audio effects.

To perform circular buffer transfers efficiently, we implemented the audio-enhanced DMA engine as a dual data movement accelerator (dMAX) that is being used in the C672x family of high-performance DSPs that makes use of Table-Guided FIFO Transfers. This engine can perform the 1-D, 2-D and 3-D transfers of a standard DMA engine, as well as perform optimized delay-based transfers common in audio effects.

A Table-Guided FIFO Transfer moves a number of taps (blocks of data) between the circular buffer and the linear memory. With a circular buffer read, a delay table guides the audio-enhanced DMA engine controller to retrieve only taps at specified offsets from the FIFO Read Pointer (RP). In the same way, a circular buffer write, a delay table guides the controller to write taps at specified offsets from the FIFO Write Pointer (WP).

Adding support for table guided FIFO transfers enables the audio-enhanced DMA engine controller to divide one circular buffer into multiple sections, where each section can correspond to either different channels or to different audio effects. In other words, support for the table-guided FIFO transfers enables the controller to use only one circular buffer per system. The block diagram in Figure 6, below shows the data flow for an audio-enhanced DMA engine implementation of the system from Figure 1.

Figure 6. Pro audio application block diagram of data flow when dMAX engine is used

The circular buffer in Figure 6 is divided into five sections, and each section is assigned to one audio effect from Figure 1. With only one circular buffer per system managing all required circular buffer reads, only one FIFO read is needed to transfer the parameter entry. Thus, all required circular buffer writes can be handled by only one FIFO write transfer parameter entry. In other words, two FIFO transfer parameters are required to describe all required circular buffer transfers in the system.

A maximum of 32,768 taps can be retrieved or stored from/to the circular buffer during one FIFO read/write. An increase in the application complexity does not require an increase of audio-enhanced DMA engine channels. Rather, it increases the number of taps moved between the FIFO and the linear memory, while all data transfers can still be managed by using only two channels.

Showing an application where one block of data is moved to or from a circularCodeListing For DMA buffer,  the code listings to the right illustrate the CPU flow when a traditional DMA engine is used, with the alternative for that used in the  alternative audio-enhanced DMA engine architecture below and to the left.

Note that when the number of required delays grows, the code listing presented in the panel to the right becomes more complex. However, the dMAX listing shown below demonstrates that when the audio-enhanced DMA engine is used the CPU workload is no longer a function of the number of delays that must be accessed in the system.

With a traditional DMA, each offset must be individually computed and transferred to the DMA engine for every tap in the filter. In the alternative audio-enhanced approach, the controller requires only initialization of parameters and after that the controller is capable of automatically maintaining transfer states for all taps.

A programmer can pre-calculate several offset tables for a Table Guided FIFO Transfer. The tables are referenced by pointers, and developers can employ a ping-pong technique to switch between different tables on-the-fly. The term ping-pong refers to the use of multiple tables that the developer can move between.

As shown in the dMAX code listings below, while the Table Guided FIFO Transfers significantly reduce the interaction between the CPU and the dual data movement accelerator compared to a traditional DMA engine, the most significant savings come from a reduction in the number of interrupts the CPU must handle.

dMAX code listing
This simple and efficient scheme keeps the number of interrupts in the system low, even when system complexity increases. Reduction in the number of required interrupts in a system is implied, as the dMAX requires less data channels for its transfers.

Real-World acceleration

In effect, the architectural enhancements of audio-enhanced DMA engines significantly offload the CPU core from having to perform data movement calculations and management, enabling the CPU to commit more cycles to what it does best: Multiply Accumulate (MAC) processing.

The benefits to an audio DSP’s performance, when paired with an audio-enhanced DMA, can be quite substantial. In one experiment involving the implementation of a Schoeder Reverb algorithm, by leveraging the Table Guided FIFO’s of the dual data movement accelerator, the CPU utilization was dropped from 20 percent to 3 percent, achieving a 6x improvement in performance. (see Table 1, below).

Table 1: Performance of a stereo 6-tap delay line reverb using varying DMA Architectures

Another important architectural characteristic of the audio-enhanced DMA engine approach is the fact that it has dual engines, operating independently of each other. Each engine has its own master to the crossbar switch that connects all the peripherals on the device.

Such comprehensive interconnection enables it to facilitate and accelerate transfers between I/O, memory and processing resources. The availability of two engines further enhances the movement of data throughout the CPU.

For example, one engine can move data from an SDRAM connected to the External Memory interface (EMIF) to data memory, while real-time data from an Multi-Channel Audio Serial Port (McASP) interface is stored in another section of memory without any overlap or contention between the two engines. Multi-channel audio applications will particularly benefit from the concurrent transfer operations that are enabled by the dual data movement accelerator structure.

In effect, the dual engine design of this alternative DMA architecture can effectively double the data transfer capacity of the DSP, since transfers can take place in an interleaved fashion. One dMAX unit can transfer to the port during the processing overhead of the other dMAX unit’s transfer.

By reducing CPU core involvement in data transfers, audio-enhanced DMA engines compound performance savings by not only offloading data transfer management functions from the CPU, but also by executing them in parallel to the CPU, enabling audio engineers to bring higher quality audio processing to a new level.

Zoran Nikolic is Senior Applications Engineer and Gerard Andrews is DSP Audio Marketing Manager at Texas Instruments Inc.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :