Tutorial: Using high precision complex-valued arithmetic in audio DSP applications - Embedded.com

Tutorial: Using high precision complex-valued arithmetic in audio DSP applications

Many new embedded consumer applications such as computer audio and video reproduction, home theatre, electronic musical instruments, and car infotainment require improved high quality digital audio. New formats, more demanding with respect to the old classical audio CD, are emerging in these mass markets to fill user needs for better sound quality at lower costs.

To meet the high performance requirements of such mainstream audio applications requires moving beyond the approaches that have been traditionally employed to reduce the computational burden and to force some algorithms into the real domain.

In the present environment, the modelling of audio can be better achieved using complex domain representations such as the STFT, Gabor transform or complex wavelets, especially if done on an advanced multicore DSP/RISC system-on-a-chip (Figure 1, below) design that natively supports floating point arithmetic in the complex domain and an extended 32 bit mantissa offering greater numeric precision without scaling problems.

Typical of the kinds of audio that can be addressed with such approaches is Dolby Digital Compression (used by most DVD-Videos, DTV, satellite and cable systems, supported also by the new PC audio interfaces), a lossy format providing up to 5.1 discrete channels of information with compression ratios ranging from 3.4:1 to 15:1.

After decoding, Dolby's discrete channels emerge for left, right, left rear, right rear and center channels, with a separate Low Frequency Effects (bass) channel for added high-impact effects. Dolby Digital uses a bit pool, which assigns bits to channels as needed (quantization varies between 16, 18, 20 and 24 bits/sample) rather than maintaining a fixed number of bits per channel.

Another is DTS (Digital Theatre Systems), a lossy audio format capable of delivering up to 6.1 discrete channels of audio in 24-bit precision. It is less compressed than Dolby Digital since compresses the seven channels independently with a 3:1 ratio, removing two-thirds of the redundant PCM data and multiplexing the remaining one-third into a single bitstream. More complex schemes have been also proposed such as THX Ultra2 Cinema Mode and THX MusicMode.

For high quality audio-only materials, SuperAudioCD has been proposed: it has the capability of reproducing up to 5.1 high quality channels using DST (Direct Stream Transfer) system. Another high-end audio format is the DVD-Audio disc that allocates most of its space to audio. It reproduces better-than-CD audio recordings using higher sampling rates (88.2, 96, and in some cases 192 kHz compared with 44.1 kHz of regular CDs) and higher resolution (20 to 24 bits compared with 16 bits of CDs) in stereo or in up to full 6-channel schemes.

To deal with such a high volume of data without reducing the quality, DVD-Audio can employ some form of lossless compression, such as MLP (Meridian Lossless Packing) a data packing method that provides a variable, signal-dependent compression ratio averaging only about 2:1 but capable to exactly reconstruct the original audio stream.

A common claim of all the new formats is that they can overcome the limitations of the audio CD standard which barely uses a stereo stream of uncompressed 16-bits PCM data sampled at 44.1 kHz. Better audio quality is achieved not only increasing the number of channels but also using more quantization bits (up to 24 per sample), higher sampling frequencies (up to 192 kHz) and more complex processing algorithms. Moreover post-processing systems, able to actively deal with the acoustic characteristics of the room where the audio is reproducing, are becoming concrete choices (such as Room Processors, Adaptable Speaker Array, etc.)

In such scenarios, specialized high performance DSP processors are essential to guarantee the required quality during all the processing stages needed for compressing, decompressing and elaborating high-precision high-sampling audio material. Besides the classical audio algorithms, such processors could allow to exploit more powerful and innovative algorithms, typically multichannel based, for new speech and music applications.

Audio applications employing complex arithmetic
The number of audio applications that make use, more or less, of complex-valued arithmetic is large and growing, including:

(1) Fast and accurate filtering of audio streams
The classical family of FFT based fast convolution algorithms, including new very low latency techniques.

(2) Audio encoding/decoding
The number of encoding/decoding schemes for speech and audio signals is really huge. Recently a lot of research activity has been devoted to new techniques for encoding the latest multichannel high quality formats. However, among all the encoding/decoding techniques, many algorithms work in frequency domain requiring complex-valued computations.
Even if common transform-based compression schemes are considered (mostly based on the widely used real-valued DCT), complex-valued implementations have been proposed, as for example the fast computation of IDCT.

(3) Acoustic Echo cancellation
The problem of acoustic echo cancellation (AEC) in hands-free devices or audio conference systems is mainly an identification and filtering problem. Due to the increasing demand of wideband speech quality, the sampling frequency of many hands-free systems is now high and therefore very long FIR filters have to be identified to model the impulse responses of typical office rooms.
The computational burden of the AEC in these cases can easily become very high, especially in stereo devices, often requiring efficient complex-valued computations such as in the block or partitioned block frequency domain adaptation schemes (e.g. generalizations of the simplest fast LMS). More advanced approaches make direct use of time-frequency analyses that can easily require complex-valued operations; see for example the use of Gabor transforms in stereo AEC devices.

(4) Noise reduction
Common companions of AEC devices are noise reduction algorithms, as for example those based on the spectral subtraction idea. Most techniques work in frequency domain and can require complex-valued computations, taking advantage of a strict integration with the AEC itself to spare complexity. Recently wavelet-based techniques have been also proposed.

(5)Audio/conditioning(3D)
New techniques are emerging to accurately reproduce audio scenes, or modify the actual listening environment. When these techniques are applied to quality audio, the required high sampling frequencies can make the modelling filters very long. Therefore cross-talk reduction and 3D filtering are often implemented using efficient complex-valued frequency domain techniques to reduce computational costs and to enhance performances. Subband based techniques are also used in these applications.

(6)Microphone array processing
Effective static or adaptive beamforming algorithms are particularly difficult to realize in the microphone array case, due to the characteristics of the problems (wideband, high reverberant). Many different proposals have been made to improve performances in real world array; they can easily require heavy computations implemented in the complex domain or can rely on time-frequency analysis/synthesis techniques such as subband approaches.

(7) Spatial localization
Both binaural and multisensor approaches to spatial localization of sound can imply complex-domain computations. An exemplary case is the self-aiming video camera usually employed in many new teleconferencing devices. The problem of automatically localizing the active talker is often mapped into the problem of repeated time-delay estimations between couples of microphones extracted by a suitable (small) microphone array. Both complex-valued full band FFT computations and subband approaches have been proposed to robustly estimate these delays.

(8)Bandwidth enhancement
In application such as speech transmission, telephony and digital radio, bandwidth enhancers are becoming more and more appealing. These non-linear devices try to reconstruct missing parts of spectra from the parts already known, e.g. get a wideband speech signal (16kHz) from its narrowband counterpart (8kHz) by predicting the missing spectral information. Such devices often employ complex-valued non-linear algorithms such as complex neural networks or other frequency domain techniques.High precision arithmetic for better audio quality
It is well-known that the perceived quality of a digitally sampled musical piece depends on the number of bits used to quantize it, provided that an adequate sampling frequency as dictated by the Nyquist theorem is chosen.

For many years 16 bits quantizers with 44.1 kHz sampling rate (standard format for audio CDs) have been considered sufficient to achieve good quality. Although this choice is adequate in many cases, it cannot be considered any longer if we want to achieve high audio quality in all situations.

Many musical genres are in fact characterized by strong dynamic variations inside the same musical piece (classic music comes in mind immediately) that make the quantization process worse. If we set the input gain so that no (or low) clipping occurs in the high amplitude segments of the signal, actually we use only a small number of bits to represent the remaining low amplitude parts of the same signal.

On the other hand, if we set the input gain higher in order to have more bits used to quantize the low amplitude segments, we must cope with the non-linear effects due to the severe clipping in the high amplitude parts of the signal.

To overcome this problem, new audio formats quantize signals with more bits per sample, e.g. 20 or 24 bits, to reproduce larger dynamic ranges without audible loss of quality. They also use higher sampling frequencies, typically 96 kHz or more (up to 192 kHz), in order to gain quality and reduce the negative effects of any anti-aliasing/anti-imaging devices. Both strategies even if effectively improve the quality of the recreated audio materials definitely demand more powerful and accurate processors to maintain such a quality during all the processing stages.

Among the consequences of raising the sampling frequency two are particularly remarkable: the increase of computational cost due the higher bit rate per second and the increase of complexity of the filtering algorithms (which means longer filters and again higher computational costs) due to the narrowing of the frequency bands.

To understand the consequences of using more bits per sample, let’s briefly recall the effects of quantization in linear filtering. They are mainly three: the quantization of the filter coefficients, the quantization of the internal computations and the non-linear effects of overflows. It is well known that the first effect causes alterations in the frequency response of the filters; let’s remember the Kaiser rule: the more the filter has dense clusters of poles and/or zeros (i.e. narrowband, high Q filters) the more the alteration is large. Moreover the alteration depends on the structure used to implement the filter: it is known that low sensitivity structures such as cascade, lattice, coupled forms etc. are much more robust to quantization than others as, for example, direct forms .

About the second effect: a classical simple linear analysis, which considers quantization operators as localized noise sources, shows that great improvements to the signal-to-noise ratio are obtained using double precision accumulators, although its final value also depends on the spectral characteristics of the filter.

Finally the presence of overflows causes complex non-linear behaviours, such as limit cycles, which can be reduced (or avoided) using extra bits and suitable signal scaling strategies (which however can lower the final signal-to-noise ratio).

In general we can say that the generated noise levels will depend on the algorithm itself (recursive computations are worse), on its spectral features (sharp, non-stationary is worse), on the chosen implementation structure (some are more robust than others), on the number of bits available for computation and on the selected arithmetic (fixed or floating point). The tolerable noise level will in turn depend on the input/output precision and on the requested quality.

Let’s consider the exemplary case of implementing a difficult linear IIR filter (e.g. narrowband non stationary) with 24 bits input/output data sampled at 96 kHz. The high sampling frequency makes the filter even more challenging, increasing its complexity and narrowing its bands. The 24-bits input/output require that such precision be maintained during all the computations of the recursive filters. Although a robust implementation structure such as lattice or coupled form cascade is employed, widely used 24-bit fixed-point and 32-bit floating-point (with 24 bit mantissa) computations do not guarantee anymore the quality needed in the future audio applications.

The 40-bit floating-point solution
As previously seen, to achieve the high quality required by the new audio applications, higher precision arithmetic must be used. Such arithmetic can be implemented in a digital signal processor using fixed-point or floating-point formats. We do not wish to contribute here to the endless discussion among the supporters of the former and the supporters of the latter technology; however at this point some conclusions can be drawn.

Fixed-point formats implemented by integer arithmetic were the first to appear in commercial DSP products and now are commonly implemented in products with precision ranging from 16 to 24 bits. Among the drawbacks of this solution two are particularly manifest:

(1) The generated noise floor is constant in level, i.e. at low levels of signal the level of noise can be not far from these, makes worse the signal-to-noise ratio; and,
(2) To exploit the available dynamic range and to avoid non-linear effects of overflows a more or less esoteric scaling policy must be implemented.

From software point of view, designing and implementing a complex DSP algorithm on a fixed-point processor can be quite difficult and time-consuming, generally requiring highly trained personnel.

On the other hand, a very skilful programmer can take full advantage of DSP hardware that can control every single round-off operation. Recently for advanced audio applications, a double precision 48-bits fixed-point format has been proposed that has been shown to overcome the performance of 32-bit floating point on particular difficult-to-implement filters. This new format typically offers 32 bits for audio sample with 8 bits of headroom and 8 guard bits, but leaves again to the programmer the problem of correctly scaling each signal inside the algorithm in order to achieve the maximum quality.

At the other end, floating-point formats appeared later in commercial DSP processors, one of the firsts being the DSP32 chip from AT&T Labs. Nowadays an increasing number of processors makes floating-point available to the programmers. A well-known standard exists (IEEE-754) which dictates 32-bits single precision, 40-bits extended precision and 64-bits double precision formats. Considering the previous drawbacks of fixed-point formats, now we can say:

(1) The generated noise floor is not constant in level, i.e. it is directly related to the amplitude of the audio itself: at low audio levels the noise will be accordingly low (“pumping” noise floor); and,
(2) No scaling is usually needed, i.e. it is possible to design digital filters without worrying about scaling the internal state variables (practically no overflow).

The pumping noise effect is generally a good feature, although it can become a drawback if not enough bits are used. From software point of view, designing and implementing a complex DSP algorithm on a floating-point processor can become a (relatively) quick and easy job since floating-point allows programmers to be free from scaling and dynamic-range problems. Considering also the rising costs of software development, the floating-point solution appears today to be the preferred choice with appropriately capable and provisioned hardware.

Among floating-point formats, the most commonly used for audio processing is currently the IEEE-754 32-bit single precision. This format, with its 24-bits mantissa, appears to be adequate for many simple audio tasks; however when the new high-precision high-sampling multi-channel applications are considered, it quickly becomes insufficient. From the brief analysis of section 2, it can be easily understood in fact that to cope with difficult-to-implement algorithms working on, for example, a single channel 24-bit digital audio sampled at 96 kHz, 24 bits of mantissa are not sufficient. As previously said, has recently shown this fact with practical experiments.

At the other end double precision 64-bits floating-point, with its 56-bits mantissa, has room to spare also to accommodate difficult computational tasks in high-precision high-sampling multi-channel applications. However the high hardware complexity required by this solution can make it unpractical in terms of speed and costs for most applications. It is interesting to note how on one hand such high number of bits is often useless because a low enough noise floor becomes quickly inaudible, while on the other hand such high number of bits does not guaranteed the absence of negative effects when very poorly designed algorithms are implemented (e.g. direct forms IIR filters).

It is easy to understand that reasonably designed algorithms can achieve professional performance also with less demanding floating-point formats having a number of bits less than 64 but more than 32. Therefore the 40-bit extended precision floating point format appears to be the best compromise between the need of precision and the computational complexity, as far as the existing IEEE-754 standard is concerned.

Due to its 32-bit mantissa and consequently low pumping noise floor, it can offer at reasonable cost all the precision and processing power needed by the new mass-market advanced audio applications together with the easiness and speed of software development offered by the floating-point arithmetic.

With a high-performance DSP processor that offers full support to IEEE-754 extended precision 40-bit floating point format, the designer can concentrate more on the algorithms than on their implementation details, to produce innovative solutions with shorter time-to-market.Digital Signal Processing and complex arithmetic
The relationship between Digital Signal Processing and complex-valued arithmetic is of long standing. Since the crucial uncovering of FFT in the year 1965, complex-valued computations are become very common in many DSP applications also when real-valued signals are processed.

It is well known in fact that real-valued discrete-time signals can be represented in the frequency domain as complex-valued functions. The Discrete-Time Fourier Transform (DTFT) is the mathematical tool to easily model signals in the frequency domain, while from an implementation point of view its sampled version (FFT/DFT) is utilized. It is well known that the FFT is an essential tool for DSP, being widely used in a large variety of algorithms (a recent survey reports the number of published papers on this subject is on the order of several ten-thousands).

Starting from FFT, various complex-valued tools have been proposed and widely used in practice such as complex cepstrum, CQT (Constant Q Transform), STFT (Short Time Fourier Transform) and others.

Other complex-valued signal analysis/synthesis tools have been introduced for DSP applications, such as Gabor transform, complex wavelets, Hilbert transform and complex lapped transforms.

Recently, non-linear techniques and algorithms have been proposed to solve various problems in many fields of DSP applications; among these techniques those based on the artificial neural network paradigm have attracted great interest. Also in this field complex-valued extensions of known neural techniques are known: non-linear structures such as complex-valued dynamic artificial neural networks have been proposed and used for non-linear filtering, signal reconstruction and recognition.

Complex arithmetic for fast algorithms
Static and adaptive filtering, identification, synthesis and recognition can be made more efficient using such complex-valued computations. In this context, also frequency-warped filters and predictors have been proposed and used. A classical application of complex-valued arithmetic is the implementation of fast algorithms exploiting the FFT computational efficiency, either in its vanilla form or in more efficient forms specialized for real-valued signals.

The first and more common algorithms are those employed for the fast computation of convolution and correlation, possibly in their windowed versions, i.e. overlap-and-save and overlap-and-add algorithms, which are, in turn, common building blocks of many other algorithms. Besides the usual fast implementation of static FIR filters, these FFT-based techniques have been extended to the implementation of optimal and adaptive filters such as FDAF (Frequency Domain Adaptive Filters), Fast Frequency-domain Block Convolution, Fast LMS, Partitioned Block FDAF and many others.

Complex arithmetic for Time-Frequency analysis/synthesis
Another large group of complex-valued algorithms is related to Time-Frequency analysis/synthesis. In fact spectral estimation for stationary or non-stationary signals heavily depends on complex-valued computations, either for classical or parametric approaches.

Classical spectral estimation techniques of deterministic and random signals, such as periodogram and Blackman-Tukey estimations, employ the direct computation of the complex-valued FFT, which makes the estimate resolution related to the number of available samples.

Parametric spectral estimation techniques instead allow overcoming the resolution constraints of the classical counterparts approximating the signal with a parametric model and taking advantage from some form of a-priori knowledge on the signal itself. Although indirect methods (ARMA Spectral Estimation, Minimum Variance Spectral Estimation), based on a biased or unbiased estimate of auto and cross correlation functions, usually employ mostly real-valued computations, direct methods generally deal with models in frequency domain, resulting in direct complex-valued computations.

A basic assumption for these spectral estimation algorithms is that the involved signals are stationary. However, it is known that discrete time sequences coming from real world samplings do not satisfy such a property. Time/frequency analysis is hence introduced to get a suitable description of frequency information of signals taking their non-stationarity into account.

Nowadays the use of Time-Frequency analysis/synthesis techniques is largely growing due to the greater accuracy offered by these models and the processing power available to implement them. The complex-valued Short Time Fourier Transform (STFT) is the most common approach to Time-Frequency analysis/synthesis, as it is the natural evolution of DFT to the non-stationary case. It is defined as:

It is a function of the (τ,ω) plane, through the modulated versions of the window signal, namely basis functions. Then, the STFT can be seen either as a sequence of Fourier Transforms of a windowed segment of the signal w(t-t)s(t) or as a modulated analysis filter bank, where the signal is filtered with a bandpass filter having as impulse response the window function modulated at that frequency. Obviously, the STFT is not the only method to represent non-stationary signals in frequency domain. Other techniques have been proposed to overcome some of its drawbacks.

The complex-valued Gabor Transform gives a solution to the high redundancy of STFT, by means of a natural discretization in both domains:

with N being the window length and frequency bin number respectively. The constitutive equation of Gabor transform hence becomes:

and understanding of such transform as a suitable filter bank is easy (see Figure 2, below). This transform is also referred to as sampled STFT, often resulting in interchanging their precise meaning.

Figuure 2: STFT and Gabor Transform as an N-band filter bank: analysis and synthesis sections

Recently, wavelet-based techniques have attracted a great interest in the DSP community, as significant processing tools. Wavelet transforms (WT) in fact overcome the problem of fixed time/frequency resolution (Δτ,Δω, respectively) of STFT.

The product of Δτ,Δω is upper bounded by a constant, as stated by the Heisenberg’s uncertainty principle: Δτ,Δω do not vary with frequency in STFT, while they do (Δω≈(1/ω)) when wavelet are employed for time/frequency representation of signals, resulting in a more accurate analysis of signals under inspection. This is particularly true when information they contain is not uniformly distributed all over the spectrum, but localized only in particular areas of it (e.g. low frequencies, in case of speech). The Continuous Wavelet Transform (CWT) is defined as follows:

where x(t) is a general continuous valued signal and α,τ are the time-scale parameters, that can be seen in terms of time-frequency plan. The discrete counterpart of the previous expression in case of discrete time signals x(n) is obtained through discretization of α,τ parameters on a suitable grid.

Two different schemes have been proposed: the multiresolution pyramid and the subband coding . The former performs an overcomplete transform, while critical sampling occurs in the latter, based on a binary subband tree structure. In dependence on the regularity of such a structure, we can have various types of wavelet transforms; the most used is the Discrete Wavelet Transform (DWT) and the Packet Wavelet Transform (PWT), implementing a dyadic and regular structure respectively.

Thanks to its tree structure, subband coding schemes can be easily interpreted as filter banks. For example DWT is nothing but an octave band filter bank, with its analysis and synthesis section, as depicted in Figure. 3 below.

Figure 3: DWT as an octave band filter bank: analysis and synthesis sections.

The two filter bank structures in Figure 2 and 3 justify the different coverage of signal spectrum through filter bank output sequences, corresponding to STFT and Wavelet Transforms, as shown in Figure 4, bellows. As previously seen, time and frequency resolutions vary with frequency only in the second case.

Figure 4: Coverage of spectrum of original signal through coefficient sequences delivered by DWT (case a) and STFT (case b) filter banks.

Critical sampling based wavelet transforms present a relevant lack of shift invariance, which means that small shifts in the input signal might give origin to major variations in energy distribution in wavelet coefficient sequences. This is why the complex-valued wavelet transform is used: it represents a valid alternative to the employ of undecimated forms of the subband structure, especially from a computational point of view.

Two are the main paradigms proposed in literature to implement a complex wavelet transform: a dual tree DWT or the DWT of a complex sequence (obtained through a suitable mapping function acting on the signal to be transformed). Anyway, coefficient sequences delivered by applying the transform to the signal are now complex-valued, not real-valued as in common DWT or PWT. This approach is really recent, but seems to be very promising for the audio research community.

Another field where recently complex-valued computations have been introduced with success is that related to the modulated lapped transforms (MLT). They are a kind of cosine-modulated filter bank with perfect reconstruction property. They have no blocking artifacts, occurring in common Discrete Cosine Transform (DCT) and present almost optimal performance for transform coding of a wide variety of signals.

Because of those properties, the MLT are being used in most modern audio coding systems, such as Dolby AC-3, MPEG-2 Layer III, and others. One disadvantage of the MLT for some applications is that their transform coefficients are real, and so they do not explicitly carry phase information. In audio processing, two applications that need complex subbands are noise reduction via spectral subtraction, and acoustic echo cancellation.

Complex lapped transform (CLT) has been introduced and used firstly with the purpose of using its phase information for motion estimation in video coding. Recently, the modulated complex lapped transform (MCLT) has been proposed as a very simple extension to the MLT, obtained by using both cosine and sine modulation; it has been applied to noise reduction and echo cancellation with promising results. The MCLT approximately twice the complexity of MLT: its application can be thought as an audio signal enhancement preliminary step before coding.An example: blind audio signal separation/deconvolution
Although not directly related to the new multichannel audio formats, blind signal separation/ deconvolution (BSP) algorithms have found applications in various fields such as biomedical and geophysical signal analysis and processing, data mining, speech enhancement, image recognition and wireless communications. However, the largest number of applications can be probably found in the area of audio signal processing, usually to pursue enhancement, separation and extraction purposes.

Nowadays, interesting applications of such blind signal multichannel algorithms begin to be possible with the use of a DSP engine equipped for high complex-valued arithmetic performance. This class of audio problems is a very difficult and computational demanding task, due to the needed simultaneous inversion of multiple long impulse responses. Also in the simple case of exactly determined system (i.e. same number of microphones and sound sources) heavy complex domain computations are required to get reasonable performances. In the underdetermined case complex-valued time-frequency analysis are usually employed to increase sparsity.

An extensively studied area in digital signal processing, Blind Signal Processing (BSP) has many practical applications, among them digital audio processing, as examination of a typical BSP problem makes clear. Suppose we have a collection of recorded signals represented by a signal vector

which is the output of a MIMO (multiple input, multiple output) dynamical system, whose input signal vector is

The goal consists of recovering the unknown vector s (t on the basis of some available a-priori knowledge on the source signals and the MIMO system. In many cases, such a system is linear, performing a convolution operation on input signals (see Figure 5, below), so the objective becomes estimating the original sources without any knowledge on the parameters of the mixing/filtering system A (z ). This means determining the parameters of the reconstruction system W (z) through an adaptive scheme, in order to deliver output signals y (k) as more similar to s (k) as possible, using only the mixture vector x (k) (with noise v (k) possibly added).

Figure 5: General structure of BSP problem

When the model is instantaneous (x (k) =As (k) ) and sources are mutually statistically independent, the problem is referred to as Blind Source Separation (BSS) and/or Independent Component Analysis (ICA), even though they cannot be interchanged rigorously.

In such a framework, parameters of the demixing matrix W are adaptively adjusted through the minimization of the Kullback-Leibler divergence between the probability density function of random variable y and the probability density function where all components of y are forced to be statistically independent. The choice of the former probability density function determines the learning algorithm for W : maximum likelihood, entropy maximization, proper ICA, nonlinear PCA. The general form for on-line adaptation becomes:

The demixing matrix at convergence corresponds to the inverse of A , up to permutation and scaling indeterminacies. A simple example is shown in Figure 6, below.

Figure 6: An example of BSS problem with two sources and two mixtures – original sources, mixtures and recovered sources are depicted.

A typical situation involving BSP is the “cocktail party problem” (Figure 7, below), where we are interested to simulate human ability to focus attention on one sound among the many occurring in a reverberant room.

Figure 7: The cocktail party problem.

Humans in fact are able to concentrate on listening to a single sound in the midst of other sounds and noises, but the mechanisms supporting this process are not completely understood yet. BSP represents an interesting approach to simulate such a capability, when we have only access to the mixture audio signals, i.e. real sequences derived from A/D conversion of acoustic signals detected through a suitable sensor (microphone) array.

However, when real-world audio signals are considered, the previous BSS instantaneous model becomes unrealistic since rooms are not instantaneous but characterized by long impulse responses. Fortunately the previous framework can be generalized and extended to this more realistic case, called Multichannel Blind Deconvolution (MBD), where dynamical systems are involved, i.e. x (k) =A *s (k) .

The simplest way of doing this consists of employing complex-valued frequency domain techniques. A convolutive mixture in the time domain corresponds to an instantaneous mixture of complex-valued signals in the frequency domain. So making xs (ω), it is possible to apply a common ICA-based learning algorithm for each frequency bin &omega, to get the relative demixing matrix W (ω) and estimated source signals y (ω). This justifies the need of suitable algorithms able to operate on complex-valued signals, as for example.

Other more specific MBD techniques have been proposed which works directly in the time domain, using real-valued signals and convolution operators. However it is important to stress that also in this case, complex-valued computations are usually employed. In fact to reduce the computational burden of the demixing system, i.e. a matrix of long adaptive FIR filters (easily thousands of taps long), the efficiency of the complex-valued FFT is used for the computation of convolutions and other operations.

The MBD approach has been proposed for many different audio situations: for example, the introduction of suitable a-priori assumptions has allowed proposing efficient ICA-based algorithms for speech enhancement in real noisy environments.

The scenario depicted above is only valid in the over-determined case, i.e. when the number of sources n does not exceed the number of mixtures m . This assumption is fundamental to get an inverse system, as we are usually interested to in BSP. Underdetermined BSP problems instead address the n >m case, of great practical interest since mixtures we have access to are usually a few mono or stereo recordings, as in the case of sound track extraction from real musical pieces.

Since ICA-based algorithms cannot be directly applied in this case, different assumptions are generally made to face such a difficult situation, being one of the most used assumptions the sparsity of sources: only a small number of sources is contemporarily presents in the mixture vector at a certain instant.

Several algorithms have been recently developed in the literature, to give a solution to the underdetermined BSS problem (m >2): the sparsity prior influences all the learning steps, i.e. both mixing matrix recovering and source separation. It is interesting to note that conversion from time to frequency domain can significantly improve sparsity of signals; this is usually achieved using complex-valued time-frequency analysis techniques as previously described.

Recently the underdetermined BSS problem has been extended to the case of delayed sources, where the mixing matrix recovering step comprises also the estimation of delays occurring at each entry of A (z) . Application of the complex-valued STFT to the involved signals is strictly required to get a proper representation of the overall system and to achieve a good level of source separation.

It must be noted that the monaural case (m =1) is much more demanding, since techniques employed to estimate A in previous algorithms cannot be easily extended. Nevertheless, many works suggest to use overcomplete representation of signals in order to describe any possible mixture as the sparse superposition of a set of atoms, defined in the frequency domain.

The goal consists of detecting the occurrence of these atoms belonging to s (k) components to reconstruct each source stream, and it can be achieved following different procedures and making no or only some a-priori assumptions on the nature of sources. Complex-valued computations are needed also in these cases.

Francesco Piazza is a professor of electronics at Universit Politecnica delle Marche (Ancona,Italy) and Piergiovanni Bazzana is the director of Advanced DSP Applications at Atmel’s Advanced DSP Division in Rome, Italy.

References:
[1] A.V. Oppenheim, R.W. Schafer, “Discrete-Time Signal Processing”, Prentice Hall, 1989.
[2] J.A. Moorer, “48-bit Integer Processing Beats 32-bits Floating-Point for Professional Audio Applications”, Proc. of 107th AES Convention, Sept, 1999.
[3] S.K.Mitra, J.F. Kaiser, “Handbook for Digital Signal Processing”, Wiley, 1993.
[4] S. Haykin, “Neural Networks: A Comprehensive Foundation”, Prentice Hall.
[5] S. Haykin “Adaptive Filter Theory”, 3rd ed. Prentice Hall.
[6] H.G. Feichtinger, T. Strohmer, “Gabor Analysis and Algorithms Theory and Applications”, Birkhäuser, 1998.
[7] M. Vetterli and J. Kovaevi, “Wavelets and subband coding”, Prentice Hall, 1995.
[8] N.G. Kingsbury, “Complex wavelets for shift invariant analysis and filtering of signals”,Journal of Applied and Computational Harmonic Analysis, vol. 10, no. 3, pp. 234-253, May 2001.
[9] F.C.A. Fernandes, R L.C. van Spaendonck, and C. Sidney Burrus, “A New Framework for Complex Wavelet Transforms”, IEEE Transactions on Signal Processing, vol. 51, no. 7, pag. 1825-1837, July 2003.
[10] H. S. Malvar, “A modulated complex lapped transform and its applications to audio process-ing,” Proc. ICASSP'99, pp. 1421-1424, Phoenix AZ, US, March 1999.
[11] E. Bingham, and A. Hyvarinen, “A fast fixed-point algorithm for independent component analysis of complex valued signals”, Neural Computation , vol. 9, pp. 1483-1492, 1997.
[12] P. Bofill, and M. Zibulevsky, “Underdetermined blind source separation using sparse representation,” Signal Processing , vol. 81, pp. 2353 – 2362, 2001.
[13] P. Bofill, “Undetermined blind separation of delayed sound sources in the frequency domain”, NeuroComputing , vol. 55, pp. 627-641, October 2003.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.