Automation is the way of the future. We live in the age of now, wanting everything answered, achieved, and received at rapid speed. Despite this fundamental shift, many people don’t embrace technology. For some, it is related to lifestyle: big companies can be too clunky to transform their system, and individuals can be stuck in their ways not wanting to learn how to navigate a touch screen. For most, however, it comes down to data — who owns it and how to keep it secure.
The solution? It’s as simple as voice. Voice enablement technology can unlock the need for automation while keeping data close, and it’s something we use every day no matter the place or platform. As digital transformation continues to impact more and more applications, voice agents are the answer. More companies are exploring building custom voice platforms, embedded into the technology, apart from popular voice agent household names like Alexa and Google Voice. Unique voice platforms will be the way forward for companies looking to keep and control their own data.
Behind the disruption is automation
As the Internet of Things (IoT) builds off of Artificial Intelligence (AI), we are starting to see the need for automation to grow. When IoT collaborates with AI, it improves the control users have over the vast and broad collections of internet devices. We are starting to see enablement of voice expand in the home and beyond, interfacing through platforms like Google Voice, Amazon Alexa, Microsoft Cortana, or uniquely created platforms. At Harman Embedded Audio, we have worked with every single voice engine on the planet, and understand the breadth of the market first-hand. We see more companies looking to build their voice-enabled products on their own custom voice assistant platforms, so they have control of the data.
The demand for voice control is growing
It’s one of the hottest trends in audio. The next big thing in user interface, now that features like touch screens are near ubiquitous, is being able to speak to a device. Voice is leading the next generation of human collaboration. Think of natural language processing on a computer: voice is processed in a way that fits what the machine would rather hear, but if you played that same processed file, it would be mechanical and unnatural. The same goes for speaking on the phone: it does not give off the same impression of being in a room with someone. This is where voice needs to go, and where unique voice platforms mentioned above will follow.
What custom voice agents look like, and what’s involved in the build
While every voice solution is different, it is important that all solutions are flexible enough to adapt to the necessary requirements of their use case while still collecting and protecting user data. To achieve this, there are three main elements involved in the build and integration of any voice agent.
The first is far field algorithms. Use a top-tier algorithm that will capture far-field voice. At my company, we use four key software algorithms from Sonique algorithms: noise suppression, acoustic noise cancellation, sound separation and beam forming, as well as voice activity detection. These algorithms are specifically developed to be used in combination with each other to support voice-enabled applications.
How do they work? Think about comparing a smart speaker with a human. The DSP/SOC acts as the ‘brain’ of the speaker, the microphones are the ears and speakers are the mouth. For us, when someone calls our name, our brain cancels all the noises around us and puts all its energy towards that keyword. This is what we’ve accomplished in a smart speaker — when the keyword is detected, the microphone uses different noise suppression techniques and puts all its power towards the source. In the process, it cancels most of the noise around it. In acoustic environments, there are many noise sources such as ambient noise, local speakers, HVAC and more, which echo feedback from the speaker to the microphone. Each of these noise sources needs its own individual solution. Sonique algorithms suppresses the noises and captures the best possible clear voice command.
Also, building a keyword spotting (KWS) engine is crucial. KWS detects keywords such as “Alexa” or “OK Google,” to start a conversation. I’ve worked with almost every KWS engine provider, and each one is powered by deep neural networks — highly customizable, always listening, lightweight and embedded. For a great customer experience in a far field voice application, the crucial component is a False Accept and False Reject rate. In a real world condition, it’s really challenging to maintain a low False Reject rate as there are many external noises such as TVs, household appliances, showers, etc., which cause imperfect cancellation of audio playback. Experienced developers tune the KWS engine to keep the False Accept Rate low.
Finally, the Automatic Speech Recognition (ASR) engine converts voice to text. ASR consists of the core speech to text (STT) tool and natural language understanding (NLU), which converts the raw text into data. The engine also requires skill, or, in other words, a knowledge base from which answers can be provided, as well as the inverse text to speech tool. We have developed an ASR engine called E-NOVA, for example, which offers multi-platform, on-premise integrations, supports multiple languages (currently seven languages and growing), and includes trainable models, third party integration support, and talker identification.
ASR is the first step that enables the voice technologies like Amazon Alexa, OK Google, Cortana or customer to respond when prompted, “what is the weather in Los Angeles?” It’s the key part that detects the spoken sound, recognizes them as words, matches them with the sound in a given language and ultimately identifies the words we say. Because of the ASR engine, the conversation feels natural. And, with modern technologies, most ASR engines take advantage of cloud computing. With additional technologies like NLU, the conversations between humans and computers are getting smarter and more complex.
Figure 1: Basic processing pipeline in voice agents. (Source: Harman Embedded Audio)
Building custom voice agents presents a host of unique challenges, however. Understanding the product’s environment is one of the key challenges of the process, and each application will vary based on the specific use case. For example, imagine cooking in your home, your hands are busy and full, when it is time to boil some water, all that’s needed is a quick request to the voice agent connected to your plumbing space: “Boil water to x degrees.” The challenge here is whether the device is able to hear what you said, and how much noise the device will cancel to get the clean signal and hear you properly. To ensure this, voice algorithms need to be tuned to hostile environments, microphone locations need to be adjusted so that they can pick up the sound, and low THD speakers should be used to help high SNR for microphones. Through this, you’ll get the clearest possible audio to the ASR engine which results in the right answer to your questions.
Moreover, imagine being on a cruise ship: the noises around you are completely different than what you hear in a living room or kitchen. The biggest challenge is training algorithms to suppress those noises and get the clean audio signal to the system for accurate response. Properly implemented, a virtual personal cruise assistant system such as the one we developed for MSC Cruises can reliably complete the steps shown in Figure 2.
Figure 2: Steps involved in typical voice assistant request. (Source: Harman Embedded Audio)
Here, a voice assistant unit in the passenger’s room detects ‘Hey Zoe’ wake word. Then, as KWS detects the key word, the entire microphone, based on noise suppression algorithms, diverts their energy towards the source and cancels the surrounding noise, such as AC noise, TV, uncorrelated noises, propeller and engine noises, wind noises, AEC, etc. Sonique algorithms are tuned to cancel all of these noises and gets the cleanest possible signal to the system. Then, when the systems gets the request, the ASR engine converts this voice to text. NLU engines then convert this text into raw data to get the answer. But we’re not done yet. To get the answer you are looking for, the knowledge skill provides the answer to the request and the ASR engine converts that data text to speech and outputs it through the speaker.
Another challenge is surrounding False Rate Rejection (FRR). The process of achieving Wake Word FRR, which is one of the checkpoints used to measure smart speaker performance, is both time consuming and costly. The system is used to verify whether the product can wake up properly whenever a wake word is detected. To achieve FRR, trained key words are essential. In our experience, combining the trained model with a top-tier algorithm allows development teams to overcome the challenge and achieve the best possible FRR. The wake word response is further tested under various conditions in a laboratory to ensure the system passes industry standards.
The advantages of employing unique voice agents
Voice agents offer great value to user experience. Music is the biggest, simplest use case, but the value of voice agents extends far beyond opening your Spotify account remotely. Voice can turn things on, interact with appliances, boil water, turn on a faucet — and more! Voice is powerful, and the agents know a lot about their users, which is why companies are looking to get ahold of their own data – own it, store it, and secure it.
Voice solutions have broad applications, but the key is to leverage a technology that works across platforms — one that is relevant to smart speakers, laptops and smartphones, on Apple, Windows or Android — and leverage the collected data to build an agent that understands, constantly learns and remembers user needs. Creating a unique voice agent enables this flexibility of use — and keeps data internal at the same time.
Nik Rathod, senior global product line manager at Harman Embedded Audio, is a hardware and software innovator committed to helping people realize the full potential of technology. He has over nine years of experience envisioning, developing and managing mass-market technology solutions with a passion for defining viable solutions and exceptional user experiences.