Achieving better voice quality: why smartphones need 3 microphones
While mobile devices continue to play a prominent role in people’s lives, voice quality has not improved significantly since the introduction of mobile phones. This is due primarily to network and device limitations, since the sound frequency range used by mobile devices has historically been constrained by narrowband, circuit-switched networks, resulting in lower voice quality than that of a face-to-face conversation.
In addition, mobile devices have been unable to adequately separate the user’s voice from background noise, forcing users to tolerate noisy, poor quality voice communication. After years of mobile network infrastructure investments in bandwidth and connectivity, mobile network operators (MNOs) are now turning their attention to voice and audio quality as a way to differentiate their service offerings by improving the user experience, satisfaction, and loyalty.
The transition from narrowband to wideband communications (i.e., HDVoice, VoLTE) yields networks capable of carrying signals of higher sound quality. As users become aware of these network and device improvements, we expect they will demand improved voice quality in the devices they depend upon, even when used on legacy narrowband networks.
Dedicated voice and audio processors are expanding rapidly as a new product category, not only in mobile handsets but also in market segments such as automobile infotainment systems, desktop PCs, digital cameras, digital televisions, headsets, and set-top boxes. IDC, an independent market research firm, estimates that voice and audio processor unit sales will grow from 63 million units in 2012 to over 1.6 billion units in 2015, representing a CAGR of 92%.
A variety of trends are driving demand for high-quality voice and audio solutions in mobile devices, including:
- users requiring more freedom in how and where they communicate;
- users expecting high-quality voice and audio from their mobile devices;
- voice becoming a preferred interface for mobile device applications;
- users increasingly relying on their mobile devices for far-field interaction, where the mobile device is held remotely from the user, such as in speakerphone mode or video conferencing;
- users’ perception of the HD video experience as negatively impacted by poor quality audio;
- OEMs continuing to expand functionality in mobile devices; and
- MNOs deploying wideband communications networks.
These trends, in turn, introduce challenges to delivering high-quality voice and audio in mobile devices, including:
- providing high quality even when used in noisy environments;
- working with the significant limitations on acoustics and signal processing imposed by the size, cost and power constraints of mobile devices; and
- implementing voice and audio signal processing techniques that are scalable and adaptable to dynamic sound environments in a way traditional technology has not been able to provide.
A field of auditory research called CASA (computational auditory scene analysis) aims to mimic the intelligibility of the human ear to separate sound sources by using conventional digital signal processing principles. The classic example is the ‘cocktail party effect’: the human ear is able to hone in on a particular conversation and separate the desired conversation from others. From the perspective of speech intelligibility, these other conversations are considered ‘babble’ noise, an actual standard type of ‘distractor’ used to test phones equipped with voice processing.
In a single microphone system, monaural cues such as frequency pitch and onset time can be used to separate sound sources. But humans have two ears for a reason: the additional binaural information provides the brain with the ability to distinguish subtle differences in both time of arrival – inter-aural time difference (ITD) – and the level of audio signals arriving at each ear –inter-aural level difference (ILD). The principle of binaural processing in CASA relies on the interpretation of ITD and ILD cues to separate sound sources.
For human listeners, ITD and ILD provide cues in complementary frequency ranges, at least for the ideal scenario of a point sound source in a free-field such as an outdoors location or in an anechoic chamber (i.e., an enclosed space insulated from noise and echoes).
ILDs are most pronounced at frequencies above approximately 1.5 kHz because it is at these frequencies that a person’s head is large compared to the wavelength of the incoming sound, thus producing a reflection.
On the other hand, ITDs exist at all frequencies, However, they can only be decoded unambiguously if the ITD is less than half the period of the wavelength at that frequency. For the spacing between ears on a human head, this leads to a maximum frequency of about 1.5 kHz, at which point we can use ITD. Note that since the 2 ‘ears’ of a phone are closer together, we can actually increase this frequency range. This explains why it can be advantageous to have closely spaced microphones – we will come back to this later.
ILD, ITD, and other acoustic attributes such as pitch, harmonics, onset time, and time spacing of sound components are used to characterize, separate, and eventually spatially locate multiple sound sources. Following this sound separation, a classification or grouping of sound sources is made to distinguish between desired (talker) and undesired (distractor) sound sources. Finally, a ‘voice isolation’ stage identifies the sound of interest, eliminating noise and other audio to deliver clear voice.