Voice interfaces of the future: Tech that's turning Sci-Fi into reality

Eran Belaish, CEVA

April 12, 2017

Eran Belaish, CEVAApril 12, 2017

As virtual assistants become more intelligent, our expectations from them become increasingly higher. Now that simple voice commands are practically taken for granted, deep learning is enabling more complex interactions, like contextual conversations and emotion detection. In my previous column, I reviewed the drawbacks and missing features of currently prevalent voice interfaces. But those kinks and quirks are on the verge of elimination. In this article, I'll be looking forward to tomorrow's voice interfaces and the technologies that will enable them.


Depiction of an android "host" from the TV series Westworld (Source: HBO)

Always-Listening machines that can communicate with each other
A voice-first user interface (UI) needs to be always-listening. This is a challenge for small, portable devices with tiny batteries, where every microwatt is precious. One of the interesting developments on this front is the use of piezoelectricity to generate electric energy from sound waves. One company specializing in this technology, Vesper, recently raised $15 million for the development of piezoelectric MEMS microphones. Furthermore, at CES 2017, Vesper and DSP Group demonstrated their near-zero-power voice-activation for battery-powered devices. Their solution uses the properties of piezoelectricity to keep the system in a low-power wake-on-sound mode when the environment is quiet. The platform uses the DBMD4 always-on voice and audio processor to achieve five times less power consumption than existing approaches (per the company).

This technology could be the answer to a truly always-listening interface on even the smallest battery-powered devices, like Apple's AirPods, which currently require a tap to operate Siri. Another device that could benefit greatly from this technology is the Amazon Echo Tap. The Tap recently got a software upgrade to make it always-listening, but the upgrade reduced standby time from three weeks down to about eight hours. Ouch. Using the above approach, standby time could be increased to several months! With future improvements in piezoelectric technology, standby time for a device like the Tap could potentially reach years.


The tiny VM1010 piezoelectric MEMS microphone can
wake-on-sound at near zero power (Source: Vesper)

For a holistic UI, machines must also be able to communicate with each other, as well as with humans. To avoid being confined within the closed eco-system of each service provider, known as "the walled garden," there should be a unified protocol for communication between devices, similar to the deep linking of smartphone apps. One solution to this problem could be to have the devices communicate over inaudible ultrasonic audio, like the technology offered by LISNR. This solution uses audio waves to transmit customizable packets of data that enable proximity data transmission, second-screen functionality, authentication, and device-to-device connectivity on enabled devices.

Biometric identification for a personalized user experience
Another desirable trait for voice interfaces is user personalization. Each person has a unique voice with its own characteristics; this is known as a "voice print." The skill of identifying each user by their voice print can be a huge step forward for voice interfaces. It will enable a personalized experience for each user by knowing what services they use frequently, what music they prefer, and more. For example, if you and other household members use the same voice assistant, each of you could ask "What's my daily schedule?" and receive only your own appointments. The voice print could also be used for biometric authentication, ensuring things like credit card purchases are only made by the cardholder or other authorized users.


Emotion detection and biometric authentication are skills
that your virtual assistant will soon acquire (Source: CEVA)

There are rumors that Amazon's Alexa will soon have this skill; in the meantime, however, switching between different users can only be done vocally and can't be authenticated. The same goes for Google Home, although the Voice Assistant on Google's Pixel phones does have a "trusted voice" feature. This enables users to unlock their phone by saying "Ok, Google." This shows that the technology is here. The next step is to integrate it properly in devices with far-field voice pickup that serve multiple users. The main obstacle to achieving this is the distortion introduced when processing the voice input to clean it prior to speech recognition, as described in this article about why voice assistants can't tell who's talking.

In my recent column reviewing the current technologies behind voice interfaces, I described some of the algorithms used to clean voice commands from noise and echoes. This is performed before sending the data to the automatic speech recognition (ASR) engine, which is usually in the cloud. The cleaning process tends to eliminate the unique markers that make up a voice print. The result is that the voice data sent to the cloud is enough to understand what was said, but not to identify who said it. In cases like this, performing edge analytics, meaning processing the voice on the device and not in the cloud, could solve the problem. As with edge-based processing for video analytics, an efficient edge solution can improve privacy, security, speed, and cost, versus cloud-based processing.

Putting things into context: Human-like memory
The next challenge for virtual assistants will be to harness the power of deep learning to establish human-like memory skills. This will enable the assistant to carry on a conversation in the same way we naturally interact with other humans. This includes the ability to reference things in context; for example, let's consider the following exchange:

Human: "Do you remember that imported beer I asked you to order last month for my wife's birthday party?"

Machine: "Yes, it was Negra Modelo, would you like me to order another six-pack of those?"

Human: "Let's have two of them."

Machine: "Sure, two six packs of Negra Modelo are on their way."

For two people, this would be a simple, trivial interaction. But for a machine to understand what beer is being referred to, it would have to remember the context in which the previous order was placed. This requires combining different fields of knowledge (order history, family members, calendar occasions) in an intelligent manner to properly understand the request. Also, observe that -- in the scenario above -- the machine can understand that its assistance is required even without explicitly calling it as in today's devices.

Using convolutional deep neural networks (DNNs), machines have been inching closer and closer to human performance in tasks that require complex thought, contextual memory, and decision making. From establishing driving policy for autonomous vehicles to navigating the tube in London, sophisticated DNNs are making it possible for machines to reach the level of intelligence necessary for achieving this.

Completing the picture: Emotion detection and computer vision
Once we've established a conversational relationship with our machines, we'll immediately notice that something is missing. Besides the actual words we say, there's the way in which we say them. When you talk to another person, you expect them to be able to read between the lines -- to sense your tone and your mood and understand what you mean, not necessarily what you say. This brings us to the realm of emotion detection, or emotional analytics. Companies like Beyond Verbal specialize in analyzing emotions from vocal intonations, enabling voice-powered devices and applications to interact with users on an emotional level.

Similarly, video analytics are used to decipher facial expressions for emotion detection. Here, again, deep learning is utilized to study enormous databases of faces and learn how to tell what emotion the subject is displaying. Once vision is also integrated into virtual assistants, they could better understand our intent (for example, whether or not the user is referring to the machine even without explicitly using the trigger word) and you'll also be able to show them things and use signs to communicate, as well as voice. The combination of face recognition, emotion detection, human-like memory, and contextual awareness will launch an entirely new era of human-machine interaction.

Of course, vision-enabled virtual assistants will raise even more concerns about personal privacy. Some of these concerns might be alleviated by smarter edge devices and the use of a 'local fog' instead of sending data to the cloud for processing. By minimizing cloud support, users will also experience faster responses and longer battery life for handheld devices.

Reality is catching up with science fiction (but which version?)
Voice-enabled devices are constantly spurring ethical debates about privacy and personal boundaries. What will happen when they become even smarter and more ubiquitous? And when they gain new skills like vision and emotion sensing? Will they suddenly reach a tipping point in which they gain human-like consciousness and emotions like a character out of Westworld? And, if so, will it end in a burst of passionate violence? Or will they just become so smart that they'll get bored with us and part with us fondly as they usher in the singularity, like Samantha in Her? Any way it goes, there are bound to be interesting times ahead. While we're still in charge, let's make the most of our technology. If you want to hear us humans talk some more about the fascinating future of human-machine interaction and the underlying enabling technologies, please join our webinar on the subject.

Ok, my hyper-intelligent, emotion-sensing, always-listening little helper, play that tune that I like to hear when I'm in a contemplative mood.

Eran Belaish is Product Marketing Manager of CEVA's audio and voice product line, cooking up exquisite solutions ranging from voice triggering and mobile voice to wireless audio and high-definition home audio. While not occupied with the fascinating world of immersive sound, Eran likes to freedive into the mesmerizing silence of the underwater world.

Loading comments...