Voice interfaces are a red-hot topic in 2017, as I surveyed in my previous post. Many have called 2017 the year of the voice interface, or some such moniker, but anyone who uses a voice interface has surely experienced some head-smackingly irritating moments. Although voice has the potential to become the ultimate human-machine interface, it's not quite there yet. In this column we will take a look at some of the problems and missing features that need to be addressed.
Trapped in a walled garden
The first problem that is already becoming apparent in existing voice interfaces is the walled garden phenomenon — each big player has their own closed ecosystem. Monetizing on a voice interface is a tricky business. As opposed to visual media like a web browser or a text search engine, ad placement is not simple to implement in a voice interface. Or course, companies like Amazon need to make sure that their voice services lead to revenue, so it's no surprise that Alexa is great at helping you buy things on Amazon.
What happens when you want to buy something from a different vendor? Apparently, the various voice assistants are each creating their own closed ecosystems of products and service, thereby limiting users' choices. One possible solution is to make everything voice activated. Then, the machines could be programmed to communicate with each other via voice, like Alexa telling the TV to record your favorite show out loud, regardless of who is the voice service provider of the TV. That would solve the problem of the walled garden, and also enable humans to understand what the machines are saying to each other. On the other hand, it might create a rather noisy home environment, with refrigerators, TVs, vacuum cleaners, lights, and other devices all talking out loud like the appliance version of Toy Story.
Currently there are still quirks to fix before this could be a reality. A video of an endless loop between Alexa and Google assistant shows one of the possible shortcomings.
Although this is an orchestrated demonstration, many more glitches pertaining to accidental triggers have occurred, like a TV news report that contained the words “Alexa, buy me a dollhouse.” You can imagine what happened.
Just how intelligent are these virtual assistants?
This leads to the next issue, which is intelligence. The robust quality of automatic speech recognition (ASR) in many devices today is possible thanks to revolutionary advances in deep learning and other fields of artificial intelligence. But just how smart are these virtual assistants and what can be expected of them?
In computer science, one the most widely acknowledged touchstones for artificial intelligence (AI) is the Turing test, named after Alan Turing who formulated it. To pass the test, an AI must be undistinguishable from a human being when interrogated. The movie Ex Machina is a superb illustration of this. Ava, the humanoid robot character, passes the Turing test with flying colors. The key to Ava's intelligence is unlimited access to information about users, along with all of humankind's interests, desires, and opinions. In this sense, the fictional company 'Blue Book' from the film is reminiscent of data collecting behemoths like Google and Facebook.
Ava from Ex Machina. How far is Alexa from this? (Source: DNA Films/Film 4)
Science fiction aside, it's quite hard to say how close we are to that type of intelligence. On the one hand, machine learning is advancing in leaps and bounds, repeatedly achieving milestones earlier than predicted by experts, like AlphaGo defeating Lee Sedol. On the other hand, many of the common chat bots are remarkably unintelligent. They make mistakes that no reasonable human would make, like offering pornography to a child, blurting racial slurs, and just being frustratingly oblivious. When confronted with these incidents, an Ava-like AI seems like a distant fantasy. By the way, if you're interested in a terrific analysis of gender issues in fictional AI depictions, check out this Wired article .
Tap to use a hands-free interface?
One of the most important and useful features of a voice interface is that it is hands-free. That's the beauty of it. You can use a voice interface while your hands are engaged in another activity, like driving (“play NPR” ), cooking (“set timer for eight minutes” ), typing (“give me a synonym for 'a lot'“ ), holding a baby (“dim the lights” ), carrying groceries (“open the door” ), and the list goes on. The idea is that you can use your voice instead of your hands. Therefore, it's quite bewildering that many voice-enabled devices require a manual trigger, like tapping or swiping, before they start listening to your voice.
The reason for this is not a mystery. Listening is an active state that requires processing. Thus, it uses battery power, which is a limited resource in portable devices. So, portable devices have some justification for using a manual trigger. But, imagine having a friend or a co-worker whom you need to poke every time you want to say something to, because they go to sleep in between activities. That wouldn't fly, would it? The same goes for a voice-activated device. Tapping it really doesn't make sense. The way to get the best of both worlds — portability and hands-free invocation — is to make more efficient use of the available resources. The processing behind the scenes must be extremely efficient to serve the exact purpose that it's meant for: always-listening.
There are several devices that are already always-on, so it's just a matter of time until the ones with power-guzzling processors are redesigned to be low-power and always-on. Evidence of this point is the recent announcement that the Amazon Echo Tap can now be used hands-free. Amazon is delivering this feature as an over-the-air software update. This also highlights how important it is to have a flexible, updateable solution in this rapidly changing dynamic market (they obviously did not have this update planned when they named the product).
Always-listening is an extremely handy (and hands-free)
feature for voice interfaces (Source: CEVA)
On the downside, following this update, battery life in always-on standby mode has been reduced to only eight hours. In my next post, I'll review technology that could potentially increase the standby time from eight hours to three months! If the suspense is too much to take, you can check out this link for a sneak preview, and for additional interesting discussions about always-on technology, you can join the Always-on Technology group on LinkedIn.
Can machines interact with humans in a completely natural manner?
There are a lot of useful voice interfaces out there, but there are still many challenges ahead to make superb, seamless voice interfaces. Many science fiction tales have given us a glimpse of a world where machines are highly intelligent and can interact with humans in a completely natural manner. In my next post, I would like to look at some of the future technologies that will bring us closer to that vision.
Eran Belaish is Product Marketing Manager of CEVA’s audio and voice product line, cooking up exquisite solutions ranging from voice triggering and mobile voice to wireless audio and high-definition home audio. While not occupied with the fascinating world of immersive sound, Eran likes to freedive into the mesmerizing silence of the underwater world.