Invention of the telephone more than 150 years ago triggered a revolution in communications. Today, the voice communications revolution is in the midst of a new quantum leap, as new classes of smart devices make it possible for artificial intelligence (AI) to extract meaning from sound and give people new ways to interact with their world in a more intuitive way. This article examines where we are today and previews technologies that will make ubiquitous voice assistants a natural part of our life.
“Mr. Watson, come here….”
The famous words uttered by Alexander Graham Bell in 1876 marked the first time that sound was electrically transmitted. This world-changing innovation remains at the center of dramatic changes in how we work, live, and play — and is an integral part of new breakthroughs in how we interact with the world around us.
In its first century, the wired telephone network connected people around the world. Then the electronics revolution of the last 50 years made voice and video conversation wireless and portable. In this decade, we have moved from hands-free telephone conversations between people to conversations with machines. While still rudimentary, this new type of human-machine interaction is driving the next leap in innovation.
Computers, smartphones, and smart speakers now feature built-in voice assistants that use cloud-based deep learning systems to let us ask questions and program actions. The same capability will soon be integrated into other devices we use every day. It is estimated that by 2020, as many as 1.8 billion people will have access to a voice assistant on devices they carry, and in other types of platforms in their homes and even in business environments, according to Statistica.
Yet, the success of voice assistant systems is still challenged by limitations in today’s technologies. Advances in AI, specialized processors, and more sensitive microphones will enhance the performance of voice assistants and accelerate market adoption.
Making conversations human
One challenge facing voice assistant systems is that human conversations are incredibly rich and interactive. Sometimes, a friend may respond to your statements before you even finish a sentence. In technical terms, response times when people talk to each other are measured in tens of milliseconds. While an occasional slow, thoughtful response is very natural when you talk with friends, imagine how awkward your daily interactions would be if the normal conversational gap included delays of up to several seconds or frequent needs to restate a question or command.
The slow pace of voice-assistant “conversation” is related to several aspects of the underlying technology. The algorithms that power voice recognition and response require a lot of processing power, so today’s smartphone and smart speaker systems record and then relay speech to computing resources in the cloud. To minimize the possibility of transmission delays, systems typically transmit low quality audio files, which leads to high error rates. And the Internet itself is a variable speed medium, so the speed of transmission can change. The combination of these two factors will always affect the quality of voice assistants that rely on the cloud to do the heavy lifting of voice recognition.
Even with these drawbacks, consumers clearly are excited about the technology. Sales of smart speaker systems, the first entirely new product after smartphones to offer voice assistants, are growing at a rate not seen since the first smartphones were introduced. Device sales in the U.S. jumped by 40% in 2018 and the 66.4 million new unit sales increased the number of smart speakers to 133 million, representing a little more than 26% of U.S. adults, according to voicebot.ai.
It also is inevitable that voice assistants will continue to get better at emulating conversation. Conversational delay will shrink and improving algorithms will make the interaction seem more like human interaction. A big part of these improvements will come from bringing processing closer to the user.
Bringing conversation to the edge
The technology that makes cloud-based voice assistants a reality now is advancing at a pace that will make these devices far more personal. Current voice assistants relay information to and from the cloud. Tomorrow, the AI that makes this possible will reside in the edge device, providing benefits in privacy, power consumption, and the responsiveness of the system. In short, edge computing promises to make voice assistants more effective by moving AI from the cloud to our home, to our workplace and to other devices embedded in the world around us. In a step toward this future, Infineon recently demonstrated the world’s lowest power edge keyword recognition solution.
One area of great promise for smarter voice assistants is in medical and personal health monitoring. For example, a high-sensitivity microphone can monitor breathing sounds while sleeping and predict the onset of sleeping disorders such as sleep apnea. Many people may be uncomfortable having this type of personal health information transmitted to the cloud for processing. Edge processing will make it possible to monitor and analyze this information by localizing audio capture, computation, and storage of the analyzed data. Users then will be able to manage how and when the data is shared. A voice assistant that assures higher levels of privacy will make people more comfortable with monitoring for heart and respiratory health, sleep states, and overall wellness.
The advances in AI that we see today are driven by deep learning research and new types of hardware used to build specialized deep learning systems. Infineon’s partner, Syntiant, a pioneer in this area, is building a new class of chips that bring deep learning to edge devices. Within just a few years, human-machine interaction aided by voice assistant technology will be an everyday occurrence for billions of people. And the technology developed for smarter voice assistants will have power use characteristics that allow for small, battery-powered intelligent audio recognition for many other applications. To forecast where else the technology has value, consider how the sounds you hear affect the way you interact with the world. Outside of the view of everyday users, voice assistant technology will become a part of the sensor suite in smart machines operating in the Internet of things (IoT) and as part of Industry 4.0.
Autonomous vehicles will also use audio input in combination with other sensors to detect and respond to the surrounding environment. Sounds such as bicycles, trains, other traffic, and shouting children are all inputs to the AI network that will enable cars to “see” objects around corners. In a factory, the sounds of operating machines can be used in smart control networks that diagnose potential problems before they happen. Smart city systems will “hear” unusual events such as glass breaking or a vehicle accident and alert proper authorities. And future generations of robots will employ audio systems as part of the sensor network supporting intelligent operation and interaction. Indeed, the list of potential applications is endless.
— Pradyumna Mishra is entrepreneur-in-residence, Infineon Technologies