I'm a firm believer that embedded speech — both generation and recognition — will be ubiquitous in the not-so-distant future. Voice control removes the abstraction layer for direct system manipulation and provides a more natural user experience. I love playing with my Amazon Echo, for example (see Do you hear an Echo? ), but it has to be said that the poor little scamp gets a tad confused when multiple people are talking at the same time.
It's certainly true that tremendous strides have been made with regard to noise reduction and cancellation technology, but present systems still leave a lot to be desired. One problem is that existing noise reduction techniques tend to work best with constant and repetitive sources, as exhibited by noise-cancelling headphones in an aircraft environment; These are great for reducing the sound of the plane's engines — not so good when it comes to transient sounds like the crying baby in the seat next to you.
As another example, when my wife (Gina the Gorgeous) is driving, she really likes her music loud. Unfortunately, our musical tastes are somewhat different, but it's a case of “Her car, Her rules” (which forms a subset of “Rule #1: The wife is always right. Rule #2: If the wife is wrong, Rule #1 applies). On one long trip to Louisiana a year or two ago, I did try using my noise-cancelling headphones, and these did cause the “ground rumble” to almost completely disappear. Unfortunately, this noise-cancellation technology doesn’t work on things like music, so the end result was to make the songs Gina was playing sound clearer and — paradoxically — louder (see also Really Annoying Music Suppressors ).
Consider the following illustration depicting the effectiveness of traditional speech recognition technology:
Not surprisingly, the recognition rate is reasonably high when the vehicle is parked, the engine off, and the windows closed. Things start to deteriorate when the vehicle is moving and the windows are partially open; they get worse when the windows are fully open; and you can kiss your commands goodbye when you turn your music on.
The key for human-to-machine interactions for such things as virtual assistants and automotive voice control is whether the machine did what it was told to do, quickly and accurately enough to satisfy the user. The bottom line is that, even with the latest-and-greatest noise reduction algorithms, today’s acoustic microphones and supporting systems cannot achieve adequate voice isolation for this level of control, especially in noisy environments.
And so we come to a really interesting startup company called VocalZoom. Every now and then, I see something that makes me say “Wow! I wish I'd thought of that!” This is one such occasion. The folks at VocalZoom specialize in creating Human-to-Machine Communication (HMC) optical sensors that facilitate a more natural, personalized, accurate, and secure voice-control experience. The multifunction HMC sensor gathers additional data generated during speech as facial skin vibrates around the mouth, lips, cheeks, and throat.
By integrating the VocalZoom optical HMC sensor into a voice-control solution and focusing it on these areas, facial vibrations can be acquired, measured, and converted to an isolated, near-perfect reference signal with which the system can operate — regardless of ambient noise levels.
I recall that the military developed technology similar to this years ago; they can bounce a laser off a glass window from a huge distance, detect vibrations caused by people talking, and reconstruct the conversation. The problem with these systems is that they cost so much you need a national budget behind you to afford one. But technology marches on, and VocalZoom's sensors provides nanometer resolution while remaining affordable enough and small enough to be incorporated into smartphones, tablets, automobiles, and… well, just about anything you can think of.
While I was chatting with the folks from VocalZoom, they mentioned that it was possible to use the output from their HMC sensor to completely construct a waveform of the human voice. I immediately asked why it was necessary to keep using the standard acoustic microphone. They responded that their reconstructed waveform didn’t have the same richness and texture as the human voice captured using an acoustic microphone. However, by using the data from the HMC to filter the signal from the acoustic microphone, you get the best of all worlds.
If you go to the VocalZoom home page and scroll to the bottom, you can play sound samples of a conversation in a tremendously noisy environment with and without VocalZoom technology. All I can say is that I'm a believer. But don’t take my word for it. Do you recall the original chart shown above depicting the effectiveness of traditional speech recognition technology using only an acoustic microphone augmented with traditional noise reduction technology? Well, now compare this to real-world in-car testing of VocalZoom technology using male and female subjects driving at 60mph (100km/h) with windows open and shut and the sound system on and off:
I'm not the only one who is convinced as to the value of VocalZoom's technology. iFLYTEK is a multi-billion dollar Chinese information technology company that creates voice recognition software and products. A senior researcher at iFLYTEK, Haikun Wang, says that they've tested the VocalZoom sensor in scenarios including the inside of a moving automobile with the windows rolled down and high wind noise, and the resulting improvement was “substantial.” Wang he also says: “We believe that VocalZoom technology gives us the foundation for breakthrough improvements.”
iFLYTEK is the developer of “iFlyTek YuDian” — the Chinese version of Siri — along with the Voice Cloud intelligent speech technology platform. With hundreds of millions of users, Voice Cloud is the leading intelligent speech platform in China for mobile cloud and embedded applications. Now, other companies are exploring the combination of VocalZoom sensor technology with iFLYTEK’s Voice Cloud.
I don’t know about you, but I think this could be the start of something big. Does anyone want to place a bet on how long it will take for VocalZoom's technology to start appearing in iOS and Android mobile platforms?