Over the past few years, significant advances in automatic speech recognition (ASR) have led to an abundance of devices and applications that use speech as their main interface. The IEEE Spectrum magazine has declared 2017 the year of voice recognition; ZDNet reported from CES 2017 that voice is the next computer interface; and many others share similar views. So, just where are we with regard to the advancement of voice interfaces? This post will survey the current state of voice interfaces and its enabling technologies.
How many of your devices get conversational with you?
Voice activation is all around us. Almost every smartphone has a voice interface, with flagships like the Apple iPhone 7 and Samsung Galaxy S7 including always-listening features. Most smartwatches offer voice activation, as well as other wearables, and especially hearables, like Apple's AirPods and Samsung's Gear IconX. In most of those devices, there is no convenient way to integrate any other interface, making voice an ideal and necessary solution. New cameras, like the GoPro Hero 5, can be operated using voice commands, which is great for selfies. Voice activated car infotainment systems have become a commodity, making it much safer to change the station while driving.
The Amazon Echo ignited the conversational assistant trend, which is on fire with Google Home trying to contend and a variety of similar clones showcased at CES 2017. The Echo's voice service, named Alexa, comes with several built-in skills. For example, you can say “Alexa, tell me a joke” (very wry delivery), “Alexa, did the Warriors win?” (of course they did), or “Alexa, who starred in the movie 2001: A Space Odyssey?” (no one else seems to know). There are also a bunch of amusing Easter eggs, like the response when you say “Alexa, initiate self-destruct sequence.” (see also this video demonstrating some of Alexa's Easter eggs).
In addition to the built-in functions, new capabilities can be added to Alexa by third parties using the Alexa Skills Kit (ASK). This ASK enables developers to teach Alexa new skills so that she (or it?) can control and interact with more products and services. As you can see in this video, for example, one person hacked his iRobot Roomba and added a skill to control the vacuum cleaner robot.
Other Alexa skills include useful things, like ordering food from a variety of eateries or hailing an Uber, and random amusements, like asking magic 8-ball questions, Seinfeld trivia, and learning new facts about fruit. Collaborations between Amazon and companies like Whirlpool and GE will also strengthen Alexa's aptitude in the smart home, by adding capabilities to control household appliances like washing machines, refrigerators, lamps, and more.
Currently, Amazon seems to be in the lead in this market, but others are making huge efforts (and investments) to catch up. Mark Zuckerberg recruited Morgan Freeman to be the voice of his artificial intelligence (AI) voice assistant. According to a note describing how he built it, Zuckerberg spent a year developing the application as a simple AI to help run his home “like Jarvis in Iron Man” (he named it Jarvis, too). Jarvis purportedly identifies who is talking by their voice, and also recognizes faces, so it can let authorized people in at the door while reporting to Zuckerberg.
Another interesting contender is a Japanese Amazon-Echo-like-device called Gatebox, which features a holographic character named Azuma Hikari.
Japan's answer to the Amazon Echo (Source: Gatebox)
On top of a simple speaker, the device utilizes a screen and a projector to bring the virtual assistant to life visually, as well as audibly. In addition to microphones, it also has cameras and motion and temperature sensors that allow it to interact with the user in a more holistic way.
How does that far-field voice pickup work?
How does a device listen to and understand your voice commands while playing music on the other side of the room? There are many components involved in enabling this feat, but a few of them are paramount. First is the automatic speech recognition (ASR) engine, which enables machines to convert the sounds we make to executable instructions. For the ASR engine to work properly, it needs to receive a clean voice sample. This requires noise reduction and echo cancellation, to filter out the interferences. The following are some of the most important technologies enabling far field voice pickup:
Deep Learning has a huge role in this. The ability to understand natural language was established quite a few years ago, but recent refinements have brought it close to human-level capability. Using learning-based techniques like Deep Neural Networks (DNNs), both language processing and visual object recognition have equaled or surpassed human performance in many test cases. DNNs are generated using massive data sets during the training phase. After the training has been performed offline, the DNNs are used to perform their function in real-time.
Adaptive Beamforming is key for a robust voice-activated user interface. It enables features like noise reduction, speaker tracking in case the user is moving while talking, and speaker separation for when several users are talking simultaneously.
Beamforming using a hexagonal microphone array (Source: CEVA)
This method uses multiple microphones in fixed positions relative to one another. For example, The Amazon Echo uses seven microphones in a hexagonal layout with one microphone at each vertex and one in the center. The time delay between the reception of the signal in the various microphones enables the device to identify where the voice is coming from and cancel out sounds coming from other directions.
Acoustic Echo Cancellation is necessary because many of the products performing automatic speech recognition also produce sounds themselves; for example, playing music or delivering information. Even while performing these actions, the devices must be able to hear so that the user can interrupt (barge-in) and stop the music or request a different action. To continue listening, the machine must be able to cancel out the sound that it generates itself. This is called acoustic echo cancellation (AEC).
Acoustic echo cancellation (Source: CEVA)
To perform AEC, the device must be aware of the sound it is making, either by analyzing the output data or by listening to the generated sounds with an additional dedicated microphone. Similar technology is also applied to remove echoes bouncing back from walls and other objects around the device.
A multi-microphone development platform for modeling DNNs, beamforming, and echo cancellation algorithms (Source: CEVA)
Another type of echo is generated by the user commands themselves when they bounce back from objects or from the walls. Cancelling such unpredictable echoes requires yet another algorithm called dereverberation. The sound is then filtered and the machine can listen for commands from the user.
Today's voice interfaces are far from perfect
On the one hand, 2017 looks like being a noteworthy year for voice interfaces considering how widespread they have already become. On the other hand, even with all the impressive advances of the past few years, there is still a long way to go.
There remain many problems with current implementations of voice interfaces in mass-produced devices, but that will be the topic for a future column. In my next post, I plan to look at some of the flaws and missing features that afflict the voice interfaces of today. Be sure to tune in.
Eran Belaish is Product Marketing Manager of CEVA’s audio and voice product line, cooking up exquisite solutions ranging from voice triggering and mobile voice to wireless audio and high-definition home audio. While not occupied with the fascinating world of immersive sound, Eran likes to freedive into the mesmerizing silence of the underwater world.