Accelerating complex audio DSP algorithms with audio-enhanced DMA - Embedded.com

Accelerating complex audio DSP algorithms with audio-enhanced DMA

Audio engineers face the challenge of designing equipment that providesbetter audio fidelity, supports more channels of audio, and handleshigher sampling rates and bit depths, while maintaining a tight,real-time processing budget.

In many professional audio applications, the primary bottleneck insystem performance is the efficient movement of audio data. Over theyears, various innovations have been introduced to digital signalprocessor (DSP) architectures that offload many I/O or data movementtasks from the DSP core, allowing it to concentrate on signalprocessing tasks.

The Direct Memory Access (DMA) engine is a critical component ofmost high-performance DSPs today. Instead of having to explicitlyaccess memory or peripherals, the DSP can configure the DMA engine toaccess the on- and off-chip resources, and facilitate the transfersbetween them. These DMA transfers can be performed in parallel withcritical DSP core processing for optimal performance.

Standard DMA engines are well-suited for traditional 1-D and 2-Dalgorithm processing, such as block copies and basic data sorting.However, many audio algorithms require more complex data transfers. Anexample of this would be a delay line, which is made up of audiosamples from a previous point in time and used to create a desiredaudio effect (e.g. an echo). The traditional DMA performance is notoptimal to manage the delay line, requiring innovations to be made inthe DMA architecture to efficiently process the required audioalgorithms.

Is there a need for DMAacceleration?
The answer to this question is yes, for two reasons. First, the numberof DMA channels in many high performance DSP engines limitsprofessional (pro) audio application. Second, because of the demandsfor high quality audio, traditional DMA in pro audio application oftenrequire more CPU involvement

Figure1. Pro audio application block diagram

The block diagram above describes data flow in a typical pro audioapplication. Each effect takes the output from the previous effect,processes the data, and forwards its output to the next effect in thedata processing chain (e.g., output from the Phaser effect is input toDelay effect, and output from Delay effect is sent to the Reverb).

The digital audio effect pictured above relies on delay lines fortheir implementation. In describing a complete system of effectsmultiple delay lines are required.. Varying the length of delays usedin the design changes the quality of the audio effect.

A delay line is a linear time-invariant system, with an outputsignal that is a copy of the input signal delayed by x samples. Themost effective way of implementing a delay line on a DSP is to use acircular buffer. A circular buffer is stored in a dedicated section oflinear memory; when the buffer is filled, new data is written, startingat the beginning of the buffer.

Circular buffer data is written by one process and read by anotherprocess, which requires a separate read and write pointer. The read andwrite pointers are not allowed to cross each other, so that unread datacannot be overwritten by new data. The size of the circular buffer isdictated by the largest delay required by the effect. In this article,the First In First Out (FIFO) and circular buffer names are usedinterchangeably.

When a traditional DMA engine is used to move data in a delay-basedaudio effect, a separate circular buffer is assigned to each effectfrom the signal processing chain. The input data fed to a particularaudio effect is stored in the circular buffer assigned to the effect. Amore detailed data flow is shown in the block diagram below. In theblock diagram in Figure 2 below, circular buffers are represented byrings. The ring representation for circular buffer is used, since itshows wrapping of the linear address space assigned to a circularbuffer. As the pointers advance through a circular buffer, the addresswill increase until the wraparound condition is hit, causing thepointer to reset to the lowest memory address or the starting point ofthe circular buffer.

Figure2. Pro audio application block diagram of data flow when traditionalDMA engine is used

To produce different delays, a DMA must retrieve delayed data fromdifferent locations within the delay line. If block processing is used,a block of data is retrieved instead of just one sample.

Traditional DMA engines usually allow programmers to specify severalparameters that entirely describe the desired transfer. Typically,these parameters are the source address, destination address, theindexes for the source and destination, as well as the transfer count.Each DMA transfer would require one channel of a typical DMA overallcapabilities.

In the block diagram above, there are five circular buffers. Atraditional DMA engine must be programmed to move data in and out fromeach of these buffers. In the application shown above, a minimum of 11DMA transfers are required to process one block of data.

This is the absolute minimum number of DMA transfers required,assuming that only one delay per effect is retrieved from each circularbuffer. In a typical application, the number of DMA transfers per datablock would be much higher. For example, reverb effect implementationalways requires more than one delay from its circular buffer.

The number of required traditional DMA transfers will increase asthe number of implemented audio effects increase. Therefore, themaximum number of traditional DMA channels available in a system canlimit the number of audio effects that can be implemented.

Limits of traditional DMA in proaudio applications
Standard DMA engines perform quite well when moving long blocks of dataat either contiguous or fixed intervals. An example of a fixed intervaltransfer is when the DMA engine accesses every fourth data sample ofthe delay line.

Typical DMA performance is not optimal, when accesses are notcontiguous or at fixed intervals. When a traditional DMA engine movescircular buffer data to generate digital audio effects, the CPU mustintervene to program the DMA parameters at least twice while processingone data block. The CPU needs to program the DMA parameters, andintervene in managing delay lines, when data accesses wrap around acircular buffer boundaries.

Figure3. Chorus block diagram

A simple algorithm example illustrating this point is the choruseffect, shown in Figure 3 above. The chorus effect is often used toalter the sound of an instrument to make it sound as if multipleinstruments are playing, If the instrument where a human voice, thenthis effect would tend to make the single voice sound like a choir. Weperceive the multiple voices or instruments, since there is alwaysimprecise synchronization and slight pitch variation when multiplevoices or instruments are playing at the same time. These are theprincipal characteristics of a chorus effect.

In Figure 3, Chorus is presented as a combination of the input withtwo of its delayed copies. The pitch deviation is modeled by a slowlyvarying amount of delay in the delayed input copies. The delay isslowly varying and amount of the deviation and its frequency iscontrolled by a low frequency oscillator (LFO).

As shown in the Chorus implementation diagram in Figure 4, below,the delay line is implemented by using a circular buffer (presented bytwo concentric circles). The chorus implementation presented in Figure4 implies use of block processing. Block size in this chorus example isfour samples. Incoming samples are stored to the circular buffer inclockwise direction.

Figure4. Block diagram of chorus implementation

Block processing manages a block (multiple samples) of data at thesame time, instead of only one sample at a time,. In this example, theCPU waits until four input samples are available, and then calculatesfour output samples. It processes these samples by combining a block ofinput samples with two blocks of delayed data fetched from the circularbuffer.

In the case that a traditional DMA controller is used (Figure 5,below), the CPU is notified by an interrupt every time a block of inputdata is ready. The CPU then calculates chorus output.

Figure5. Chorus implementation timeline when traditional DMA is employed

The DMA engine assignment in this example must perform the two keyoperations:

1) Store a block of inputsamples to the circular buffer (for future reference) 2) Retrieve two blocks of delayeddata from the circular buffer (prepare delayed data for the next blockof input samples).

In this case, the CPU must assist the DMA by tracking andprogramming the source and destination addresses—and by interveningwhen data accesses wrap around the buffer boundaries. This requires areconfiguration of the DMA engine before each transfer.

Each offset must be calculated by the CPU (or taken from apre-calculated table) before the CPU reconfigures the DMA. The CPUbandwidth is utilized, since it must reconfigure the DMA engine beforeeach transfer. In Figure 5, the CPU timeline activity is presented ontwo lines: on the first line, CPU activity required to process choruseffect is presented, and on the second line, CPU activity required toconfigure DMA is shown.

In the case of complex digital audio effects, such as reverb, thenumber of delayed blocks that must be retrieved from a circular buffercan reach 256 or more. In addition, each of these delay blocks are notat fixed intervals and the offsets change continually, as the algorithmruns. With the dramatic increase in the amount of accesses to data inthe circular buffer, the more complicated digital audio effectalgorithms, like reverb, will demand more CPU cycles. This leaves lessCPU bandwidth that can be dedicated to actual application.

When several digital audio effects follow one after another (asshown in Figure 1), the CPU will have to assist the DMA in moving datarequired and produced by each processing stage. The CPU and the DMAmust be synchronized during these tasks. The synchronization isfacilitated by the DMA, which interrupts the CPU.

Therefore, the number of interrupts in the system will rise assystem complexity increases. These interrupts come with a highoverhead, since registers must be saved to preserve context. Inaddition to this, interrupts also go through the processing pipelineand disrupt the delicate efficiency of the instruction cache.Preserving the context can consume a good number of cycles, as well asfurther alter the performance of the instruction cache. Excessiveinterruptions of the pipeline also directly effects overallperformance.

The advantages of audio-enhancedDMA
The traditional DMA engines are not a perfect fit for pro audioapplications. An ideal DMA engine tailored for a pro audio applicationmust:

1) Be smart enough tooffload the CPU core from having to perform data movement calculations
2) Keep the number of requiredtransfer-related interrupts in a system low (the number of datatransfer- related interrupts should not rise as system complexityincreases)
3) Keep the number of DMAchannels required for the pro audio application independent from thecomplexity of the application.

An audio-enhanced DMA engine should be designed to accelerate audioprocessing algorithms through architectural innovations that offloadprocessing from the CPU core and increase the overall level ofconcurrent and parallel processing the DSP is able to achieve. TIintroduced an audio-enhanced DMA engine in TI’s C672x family ofhigh-performance DSPs. This architecture can perform the 1-D, 2-D and3-D transfers of a standard DMA engine, as well as perform optimizeddelay-based transfers common in audio effects.

To perform circular buffer transfers efficiently, we implemented theaudio-enhanced DMA engine as a dual data movement accelerator (dMAX ) that is being used in theC672x family of high-performance DSPs that makes use of Table-GuidedFIFO Transfers. This engine can perform the 1-D, 2-D and 3-D transfersof a standard DMA engine, as well as perform optimized delay-basedtransfers common in audio effects.

A Table-Guided FIFO Transfer moves a number of taps (blocks of data)between the circular buffer and the linear memory. With a circularbuffer read, a delay table guides the audio-enhanced DMA enginecontroller to retrieve only taps at specified offsets from the FIFORead Pointer (RP). In the same way, a circular buffer write, a delaytable guides the controller to write taps at specified offsets from theFIFO Write Pointer (WP).

Adding support for table guided FIFO transfers enables theaudio-enhanced DMA engine controller to divide one circular buffer intomultiple sections, where each section can correspond to eitherdifferent channels or to different audio effects. In other words,support for the table-guided FIFO transfers enables the controller touse only one circular buffer per system. The block diagram in Figure 6,below shows the data flow for an audio-enhanced DMA engineimplementation of the system from Figure 1.

Figure6. Pro audio application block diagram of data flow when dMAX engine isused

The circular buffer in Figure 6 is divided into five sections, andeach section is assigned to one audio effect from Figure 1. With onlyone circular buffer per system managing all required circular bufferreads, only one FIFO read is needed to transfer the parameter entry.Thus, all required circular buffer writes can be handled by only oneFIFO write transfer parameter entry. In other words, two FIFO transferparameters are required to describe all required circular buffertransfers in the system.

A maximum of 32,768 taps can be retrieved or stored from/to thecircular buffer during one FIFO read/write. An increase in theapplication complexity does not require an increase of audio-enhancedDMA engine channels. Rather, it increases the number of taps movedbetween the FIFO and the linear memory, while all data transfers canstill be managed by using only two channels.

Showing an application where one block of data is moved to or from acircularCodeListing For DMA buffer,  the code listings to the right illustrate the CPU flow when a traditional DMA engine is used, with thealternative for that used in the alternative audio-enhanced DMA engine architecture below and to theleft.

Note that when the number of required delays grows, the code listingpresented in the panel to the right becomes more complex. However, thedMAX listing shown below demonstrates that when theaudio-enhancedDMAengine is used the CPU workload is no longer a function of the numberof delays that must be accessed in the system.

With a traditional DMA, each offset must be individually computedand transferred to the DMA engine for every tap in the filter. In thealternative audio-enhanced approach, the controller requires onlyinitialization of parameters and after that the controller is capableof automatically maintaining transfer states for all taps.

A programmer can pre-calculate several offset tables for a TableGuided FIFO Transfer. The tables are referenced by pointers, anddevelopers can employ a ping-pong technique to switch between differenttables on-the-fly. The term ping-pong refers to the use of multipletables that the developer can move between.

As shown in the dMAX code listingsbelow , while the Table GuidedFIFO Transfers significantly reduce the interaction between the CPU andthe dual data movement accelerator compared to a traditional DMAengine, the most significant savings come from a reduction in thenumber of interrupts the CPU must handle.

dMAX code listing
This simple andefficient schemekeeps the number of interrupts in the system low, even whensystem complexity increases. Reduction in the number of requiredinterrupts in a system is implied, as the dMAX requires less datachannels for its transfers.

Real-World acceleration

In effect, the architectural enhancements of audio-enhanced DMAenginessignificantly offload the CPU core from having to perform data movementcalculations and management, enabling the CPU to commit more cycles towhat it does best: Multiply Accumulate (MAC) processing.

The benefits to an audio DSP’s performance, when paired with anaudio-enhanced DMA, can be quite substantial. In one experimentinvolving the implementation of a Schoeder Reverb algorithm, byleveraging the Table Guided FIFO’s of the dual data movementaccelerator, the CPU utilization was dropped from 20 percent to 3percent, achieving a 6x improvement in performance. (see Table 1,below).

Table1: Performance of a stereo 6-tap delay line reverb using varying DMAArchitectures

Another important architectural characteristic of the audio-enhancedDMA engine approach is the fact that it has dual engines, operatingindependently of each other. Each engine has its own master to thecrossbar switch that connects all the peripherals on the device.

Such comprehensive interconnection enables it to facilitate andaccelerate transfers between I/O, memory and processing resources. Theavailability of two engines further enhances the movement of datathroughout the CPU.

For example, one engine can move data from an SDRAM connected to theExternal Memory interface (EMIF) to data memory, while real-time datafrom an Multi-Channel Audio Serial Port (McASP) interface is stored inanother section of memory without any overlap or contention between thetwo engines. Multi-channel audio applications will particularly benefitfrom the concurrent transfer operations that are enabled by the dualdata movement accelerator structure.

In effect, the dual engine design of this alternative DMAarchitecture can effectively double the data transfer capacity of theDSP, since transfers can take place in an interleaved fashion. One dMAXunit can transfer to the port during the processing overhead of theother dMAX unit’s transfer.

By reducing CPU core involvement in data transfers, audio-enhancedDMA engines compound performance savings by not only offloading datatransfer management functions from the CPU, but also by executing themin parallel to the CPU, enabling audio engineers to bring higherquality audio processing to a new level.

Zoran Nikolic is Senior Applications Engineer and Gerard Andrewsis DSP Audio Marketing Manager at TexasInstruments Inc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.