CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Integrating and evaluating speech algorithms



Embedded Systems Design
Maintaining the proper level of performance is the key to integrating speech algorithms.

Designing and developing an embedded system from scratch and making it stable is always a challenge. Integrating and evaluating a digital signal processing (DSP) algorithm with the system is equally tricky and can bring even the strongest programmers to their knees. Today, uncountable numbers of algorithms are embedded into various electronic systems. How does an embedded system developer know which algorithm to use for speech processing, such as that found in basic telephony systems?

The audio frequency spectrum that stretches to 40 kHz is divided in two bands. The speech components consume the lower part of spectrum, from 5 Hz to 7 kHz, with other audio components residing on the remaining higher portion, as shown in Figure 1.

View the full-size image

Speech processing mainly involves compression-decompression, recognition, conditioning, and enhancement algorithms. Signal-processing algorithms are very dependent on system resources, such as available memory and clock capacity. As these resources add cost to systems, they're often restricted by the product vendor to keep the product cost low. Basic traits, such as memory and clock consumption, are inherent parts of an algorithm's complexity. The lesser the complexity, the better the algorithm, provided it does its job efficiently.

Measuring an algorithm's complexity is the first step when evaluating an algorithm. The clocks required to run the algorithm on a specific processor determine the processing load, which is architecture dependent and varies with different processors. Memory requirements of the algorithm remain the same obviously. Most of the DSP algorithms work on a collection of samples, better known as a frame. The collection of samples to form a frame introduces an inevitable delay that is further followed by the actual processing delay. The International Telecommunication Union (ITU) standardizes the acceptable delay for each algorithm.

The algorithm's processing load is typically represented by the term "million of clocks per second," or MCPS. To better understand MCPS, assume that an algorithm that processes a 64-sample frame at a frequency of 8 kHz and requires 300,000 clocks to process each frame. The time required to collect the frame is 64/8,000, or 8 ms. A little math shows that 125 frames can be processed per second. When the algorithm processes all the frames, it consumes at least 300,000 125 = 37,500,000 clocks from the core per second, or 37.5 MCPS.

Another way to represent MCPS is that it equals (the clock required to execute one frame times the sampling frequency divided by the frame size) divided by one million.

A second term that's often used to define an algorithm's processing load is MIPS, or millions of instructions per second. The calculation of MIPS for an algorithm can also be a little tricky. If the processor effectively executes one instruction per cycle, the MCPS and MIPS ratings for that processor are the same. On the other hand, if the processor architecture takes more than one cycle to execute an instruction, there's a ratio between MCPS and MIPS. For example, an ARM7TDMI processor effectively requires about 1.1 cycles per instruction.

Before practicing integration
The right time to start integrating and evaluating any speech algorithm on any embedded system is when the system is in a predictable or stable stage. "Stable" means that the audio front-end's interrupt structure is consistent. In other words, not even one byte of data is lost when a decent amplitude-level is maintained. It's wise to have all the statistics of the system memory and clock available. Integrating algorithms on a properly functioning existing system is comparatively easier. If the system is under development, be sure to test the audio front-end thoroughly before trying to integrate or evaluate any algorithms on it. Also, verify that no interrupts are conflicting with each other within the system. If any problems exist in the system, algorithm debugging can be an unpleasant experience.

In a system that will incorporate audio/speech algorithms, robust audio firmware is must. It must provide accurate data to the algorithms to perform efficiently. One simple mistake engineers often make is to interrupt the core on each sample arrival. If the algorithm operates only on a frame of some fixed number of samples, other interrupts are simply redundant. Direct memory accesses (DMAs) and internal FIFOs can be configured to interrupt the core after a complete frame is collected.

Example algorithms
When developing any telecom system, engineers often start testing for voice quality with the typical pulse-code modulation (PCM) codec, known as the G.711 standard. This narrow-band codec restricts the sample amplitude to 8-bit precision and produces a 64-kbit/s throughput. The encoder and decoder may work on each data sample. It's a weightless algorithm with trivial complexity and almost no processing delay, which gives engineers the option to play with the codec, verify the system, and most importantly, thoroughly validate the audio front-end design. Engineers can check the signal levels, adjust the hardware codec gains, synchronize the near- and far-end interrupts, verify the DMA function, and conduct any other experiment successfully using this basic telephony standard. During this process, don't be surprised to find that the compressed data received from other end is bit-reversed. A piece of bit-reversing code takes care of that problem.

Any wideband speech codec is an example of a speech algorithm that's a little heavy in terms of memory and clock consumption. One example is the sub-band ADPCM (adaptive differential pulse-code modulation) algorithm, standardized as G.722. It operates on data sampled at 16 kHz and thus covers the entire speech spectrum. It retains the unvoiced frequency components—those that exist between 4 and 7 kHz—and provides high-quality natural speech. Before any codec is integrated in the system, it's highly recommended that you carefully test it. Although G.711 encoding and decoding can be tested on a sample-by-sample basis, codecs that involve filters and other frequency-domain algorithms are tested differently with a stream of at least few thousand samples. The codec verification engages the engineer in unit testing with ITU vectors, signal-level testing, and interoperability testing with other available codecs. Interoperability issues related to arranging the encoded data bytes into 16-bit words before transmitting and mismatching in signal levels should not be new to system integration engineers.

The algorithms discussed aren't necessarily the ones system designers would be integrating, as these expect more memory and clock cycles from the system. Other examples of processor-intensive algorithms include echo cancellers, noise suppressors, and Viterbi algorithms. Evaluating the performance of these algorithms isn't as easy as the speech codecs.

Generally, any telecom system that involves a hands-free or speaker mode employs an acoustic echo canceller to prevent the secondary party from hearing his own voice as an echo. If employed in a noisy environment, a noise-control algorithm is also needed. The echo canceller-noise reducer (ECNR) demands a lot of memory and clock cycles from the system.

Time- and frequency-domain techniques exist to alleviate the problem of acoustic echo, as shown in Table 1.

View the full-size image

The frequency-domain techniques have proven to be more efficient with less computational cost. Such a technique uses an adaptive FIR filter that can only update its coefficients when it finds the residual echo error is larger than the threshold. Subtracting the estimated echo from the input signal gives the error. The signals from the secondary party are used as a reference to these algorithms to estimate the echo. Providing the proper reference to the algorithm is needed to get good echo estimation and cancellation.

Another factor, echo tail length, is the echo reverberation time measured in milliseconds. In simple terms, it's the time taken in echo formation. This factor depends on the dimensions of the environment. Although detailed filter design can be a complex topic, choosing the filter length isn't too complex (see Table 2):

Filter length = echo tail length × sampling frequency

View the full-size image

One basic requisite for any echo-cancellation (EC) implementation is to support the sampled data until at least 16 kHz to ensure that wideband speech is covered. Integrating EC with wideband speech codecs requires more attention. As the echo tail length depends on the sampling frequency, EC that operates up to 72 ms with data sampled at 8 kHz will effectively cancel the echo of only half the span when applied on the 16-kHz sampled data. Also, compared with 8 kHz, collection of a frame takes only half the time. Hence, engineers find that integrating a half-effective EC with wideband codecs a doubly challenging job. Noise reduction techniques have also been used for many years. Depending on the application, the approach is chosen, implemented, and applied. For example, one technique may consider noise as more stationary than the human speech. The algorithm will model the noise and then subtract it from the input signal. The noise-reduction level is measured in decibels (dB). A decay of 10 to 30 dB is a decent reduction for many applications.

The EC tail-length requirement for the application discussed is around 50 ms and the NR level required may vary from 12 to 25 dB, depending on the noise attributes and expected output voice quality. Generally, with high levels of noise reduction, there's a risk of losing the speech quality. So, dynamically select a level that gives a reasonable amount of noise reduction, yet still maintains adequate voice quality. The ECNR combination for this application may require up to 15 to 20 kbytes of memory from the system. The processing of each 64-sample frame can consume anywhere from 150,000 to 300,000 clocks, depending on the processor.

Evaluating the performance of the ECNR combination can be tricky. It's common to tune the hardware codec gains, correct the placement of the microphone and speaker, find synchronization between the far- and near-end speech and interrupts, find the audio hardware with linear attributes, and conduct trials with various EC tail lengths and NR levels to achieve the best possible echo cancellation and noise reduction.

While evaluating the complexity of any algorithm, it's important for beginners to consider the worst cases. Execution time of algorithm may vary for different frames. This data dependency simply comes from the fact that a processor might take more time to multiply two samples of higher amplitude than multiplying two samples of lesser amplitude.

One specific example of being cheated with adaptive algorithms is that cycle consumption would be lesser when the filter coefficients aren't updated. Adaptation of filter data can take several thousands of cycles, which should obviously be considered when analyzing MCPS measurement. However, don't rely solely on the algorithms—try various vectors to find the most accurate MCPS and performance measurement.

Collecting the bits and pieces
The algorithms discussed are good enough to make a basic telephony system. When the system has more than one enhancement algorithm, the sequence of algorithms to be called makes a difference. A few speech algorithms, such as noise reducers, may introduce attributes of nonlinearity to their output and this may hamper the performance of other algorithms. Such an algorithm must be placed as the last module in the process of speech enhancement.

Nitin Jain currently works in the research and development group of MindTree Consulting in Bangalore, India. He holds an engineering degree in electronics and communications. Jain can be reached at nitin_jain@mindtree.com.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :