While mobile devices continue to play a prominent role in people’s lives, voice quality has not improved significantly since the introduction of mobile phones. This is due primarily to network and device limitations, since the sound frequency range used by mobile devices has historically been constrained by narrowband, circuit-switched networks, resulting in lower voice quality than that of a face-to-face conversation.
In addition, mobile devices have been unable to adequately separate the user’s voice from background noise, forcing users to tolerate noisy, poor quality voice communication. After years of mobile network infrastructure investments in bandwidth and connectivity, mobile network operators (MNOs) are now turning their attention to voice and audio quality as a way to differentiate their service offerings by improving the user experience, satisfaction, and loyalty.
The transition from narrowband to wideband communications (i.e., HDVoice, VoLTE) yields networks capable of carrying signals of higher sound quality. As users become aware of these network and device improvements, we expect they will demand improved voice quality in the devices they depend upon, even when used on legacy narrowband networks.
Dedicated voice and audio processors are expanding rapidly as a new product category, not only in mobile handsets but also in market segments such as automobile infotainment systems, desktop PCs, digital cameras, digital televisions, headsets, and set-top boxes. IDC, an independent market research firm, estimates that voice and audio processor unit sales will grow from 63 million units in 2012 to over 1.6 billion units in 2015, representing a CAGR of 92%.
A variety of trends are driving demand for high-quality voice and audio solutions in mobile devices, including:
- users requiring more freedom in how and where they communicate;
- users expecting high-quality voice and audio from their mobile devices;
- voice becoming a preferred interface for mobile device applications;
- users increasingly relying on their mobile devices for far-field interaction, where the mobile device is held remotely from the user, such as in speakerphone mode or video conferencing;
- users’ perception of the HD video experience as negatively impacted by poor quality audio;
- OEMs continuing to expand functionality in mobile devices; and
- MNOs deploying wideband communications networks.
These trends, in turn, introduce challenges to delivering high-quality voice and audio in mobile devices, including:
- providing high quality even when used in noisy environments;
- working with the significant limitations on acoustics and signal processing imposed by the size, cost and power constraints of mobile devices; and
- implementing voice and audio signal processing techniques that are scalable and adaptable to dynamic sound environments in a way traditional technology has not been able to provide.
A field of auditory research called CASA (computational auditory scene analysis) aims to mimic the intelligibility of the human ear to separate sound sources by using conventional digital signal processing principles. The classic example is the ‘cocktail party effect’: the human ear is able to hone in on a particular conversation and separate the desired conversation from others. From the perspective of speech intelligibility, these other conversations are considered ‘babble’ noise, an actual standard type of ‘distractor’ used to test phones equipped with voice processing.
In a single microphone system, monaural cues such as frequency pitch and onset time can be used to separate sound sources. But humans have two ears for a reason: the additional binaural information provides the brain with the ability to distinguish subtle differences in both time of arrival – inter-aural time difference (ITD) – and the level of audio signals arriving at each ear –inter-aural level difference (ILD). The principle of binaural processing in CASA relies on the interpretation of ITD and ILD cues to separate sound sources.
For human listeners, ITD and ILD provide cues in complementary frequency ranges, at least for the ideal scenario of a point sound source in a free-field such as an outdoors location or in an anechoic chamber (i.e., an enclosed space insulated from noise and echoes).
ILDs are most pronounced at frequencies above approximately 1.5 kHz because it is at these frequencies that a person’s head is large compared to the wavelength of the incoming sound, thus producing a reflection.
On the other hand, ITDs exist at all frequencies, However, they can only be decoded unambiguously if the ITD is less than half the period of the wavelength at that frequency. For the spacing between ears on a human head, this leads to a maximum frequency of about 1.5 kHz, at which point we can use ITD. Note that since the 2 ‘ears’ of a phone are closer together, we can actually increase this frequency range. This explains why it can be advantageous to have closely spaced microphones – we will come back to this later.
ILD, ITD, and other acoustic attributes such as pitch, harmonics, onset time, and time spacing of sound components are used to characterize, separate, and eventually spatially locate multiple sound sources. Following this sound separation, a classification or grouping of sound sources is made to distinguish between desired (talker) and undesired (distractor) sound sources. Finally, a ‘voice isolation’ stage identifies the sound of interest, eliminating noise and other audio to deliver clear voice.Real-world implementation challenges
To arrive at acommercially viable product, developers have to implement thesetheoretical principles reliably in a real-world environment. Forexample, users can hold a phone in various positions, either in a‘close-talk’ configuration with the phone held to the ear, or a ‘fartalk’ speakerphone type of scenario with the phone held vertically ineither portrait or landscape orientation in one’s hand, restinghorizontally on a table, or even resting in a cup holder inside a car.
Thereis always a trade-off between the amount of noise suppression and theresulting voice quality. Therefore, it becomes challenging to have thephone produce high quality output in all scenarios when applying noisesuppression. Furthermore, there are several other factors that affectthe quality – and complexity – of an implementation, including roomacoustics like reverberations that cause multi-path time-of-arrivals atdifferent amplitudes, poor microphone sealing causing microphonecrosstalk through the mechanical enclosure, and local pickup of thesound from the far-end into the near-end microphone through the speakerwhen in speakerphone mode (and thus driving the need for embeddedacoustic echo cancellation).
For some binaural processingalgorithms that rely on specific cues, mics may need to be spaced closetogether (CM: close mic) while for others it is acceptable to have alarger distance between mics (SM: spread mic). The correlation of soundbetween these microphones is important as well. For example, aparticular algorithm may rely on one mic having a direct sound pathwhile another mic receives a signal that undergoes a ‘shadowing’ effectin the path between the sound source and microphone.
Practicalindustrial design of today’s smartphones places additional constraintson microphone placement. For instance, for a close-talk use case, itwould be logical to position microphones at the bottom-front andbottom-back locations of the phone. This creates a CM pair as well asdirect and shadow paths to the user’s mouth in the so-called ‘end-fire’configuration known from phased array theory (Figure 2 ). However,bottom-back mic placement is restricted in many devices by the presenceof other components (e.g., WiFi antenna, battery, etc.) or by othermechanical restrictions such as an OEM’s reluctance to drill a hole inthe back casing. Instead of using such a back-bottom mic placement, manyphones use a secondary mic at the top-edge of the unit (Figure 3 ).
Figure 2: Ideal microphone placement
Figure 3: Realistic microphone placement
Whencombined with the bottom mic, this creates an SM pair instead of a CMpair. It also requires any voice algorithms to be adapted. Furthermore,it becomes more challenging to obtain good separation between voice anddistractor sounds as the ‘end-fire’ phased array approach cannot beused. Thus, there is a trade-off between industrial design, cost, andperformance. It is the ability to maintain good performance under suchnon-ideal conditions that distinguishes one voice processingimplementation from the other.
To still take advantage of thebenefits of having a CM pair, high-end smartphones can place a thirdmicrophone at either the top-front or top-back location of the device.This third mic then forms a CM pair with the top-edge microphone. Thevoice processor can now select the optimal 2-mic pairing, depending onfactors such as:
- Specific customer use case: close-talk, far-talk
- Phone orientation: horizontal/vertical, portrait/landscape
- Applications: narrowband voice call, wideband VOIP, multi-mic multimedia recording, etc.
- Phone placement: concealment of performance loss in case one of the mics is occluded or blocked
Today’snext-generation voice processors enable enhanced performance byutilizing input from multiple pairings. For example, Audience’s voiceprocessor not only selects an appropriate 2-mic pair depending on theseconditions but can also apply simultaneous 3-mic processing to furtherimprove performance.
In the PC space, new ‘convertible’ designswith screens that can be rotated or flipped introduce a new set offactors that must be addressed. Specifically, users are no longer in afixed position with respect to the PC microphones, as they are with atraditional notebook. In fact, the microphone(s) may be occluded whenthe screen is flipped over or the user holds the ‘detachable’ PC like atablet.
With these types of form factors, legacy PC-based voiceprocessing technology like beamforming runs into performancelimitations. In addition, noise suppression will need to managePC-related noise such as keyboard tapping and fan noise, as well ascancel loud music from nearby PC speakers. As outlined above, similarproblems have already been resolved for mobile phones, and so PC-basedvoice processing is a natural evolution for this technology intoapplications like VOIP calls, multimedia recording, and speechrecognition.
Consumers have become familiar with the benefits ofaudio processing on quality from their PCs and home theater systems.The technology is in such demand that an entire industry has formedaround audio output IP for surround sound processing and other kinds ofsound enhancement. OEMs of mobile handsets and portable electronicsdevices already see the need for similar technology within their owntarget markets. This is evidenced by the more than 350 million discretevoice processor ICs that have shipped to date by companies like Audienceinto mobile products.
After the evolution from keyboard and GUIto touchscreens, voice input represents the next paradigm shift in howwe interact with electronic devices. Yet despite the ready availabilityof such technology, consumers do not yet associate the same need forvoice quality on their handset as they have become accustomed to withtheir PCs and home theater systems. However, the confluence of widebandnetwork infrastructure deployment with the need for reliable speechrecognition and improved audio quality in multimedia recording willdrive consumer awareness in the near future, not only for mobilehandsets but for all kinds of consumer equipment.
Bart DeCanne isVP Marketing at Audience, a voice processor company for mobile devicesbased in Mountain View, CA. Until joining Audience earlier this year,Bart drove the business strategy for touchscreen controllers, the‘pre-voice’ user entry method for mobile devices, at Cypress and Maximsince 2009. Prior to that, he worked for Silicon Labs & TexasInstruments in the US, and Barco in his native Belgium. Bart holds aMSEE from the University of Ghent (Belgium) and a MBA from UT-Austin.