How audio edge processors enable voice integration in IoT devices -

How audio edge processors enable voice integration in IoT devices

Dedicated audio edge processors with a focus on audio fidelity and with machine learning optimized cores are the key to providing IoT devices with voice user interfaces without the need for a high bandwidth internet connection.


Dedicated audio edge processors with a focus on audio fidelity and with machine learning optimized cores are the key to providing IoT devices with voice user interfaces without the need for a high bandwidth internet connection.

Voice processing capabilities are quickly emerging in consumer products such as the iOttie Aivo Connect. (Source: Knowles)

From home automation and eCommerce to healthcare and automotive, more industries are now combining IoT capabilities with voice integration to meet shifting demands and unlock business advantages. Yet, voice is still in the early phases of adoption and just beginning to expand beyond mobile devices and speakers. Voice will become the standard method of interaction between users and their IoT devices. This shift to voice first is underpinned by more than just the idea that it increases consumer comfort levels with technology. Global mobility of voice-enabled devices for on-the-go voice search, progress in natural language processing (NLP), and advancements in artificial intelligence and machine learning will enable new applications to evolve quickly.

Enjoyable and engaging voice interaction is limited by consistent sound quality in the presence of noise and other distractors. The ability of your device to intelligently manage sound is what makes or breaks your ability to communicate. It is expected that always-on voice user interface (VUI) will become common place in more consumer products, including audio and video devices, white goods, and also in a broad range of battery powered devices, such as remote controls, wearables, Bluetooth speakers, security, and outdoor activity cameras. While there are design challenges to overcome, there is a big opportunity for component suppliers and OEMS alike to deliver products that satisfy these application needs.

To take full advantage of voice integration opportunities as they mature, more processing technologies are moving to the edge, away from the cloud. The results are improved user interfaces with lower latency and reduced cost, both in dollars and bandwidth. Manufacturers designing IoT-enabled CE solutions for tomorrow must consider voice integration to be a product feature prerequisite. OEMs that can deploy dedicated voice processing at the edge will be able to scale these applications and expand their portfolios.

This article discusses the most common challenges with implementing VUIs for IoT always-on/always-listening devices. The article reviews the associated requirements, and design capabilities needed to address those requirements effectively including integration with control interfaces, software stacks, algorithm development, and user space application development.

Integrating Audio Edge Processors into IoT Devices

Dedicated audio edge processors with a focus on audio fidelity and with machine learning optimized cores are the key to supporting high-quality audio communication devices. These processors can deliver enough compute power to process audio using traditional and ML algorithms while using a small percentage of the energy of a generic processor. And since the processing happens on the device it is significantly faster than sending that information to the cloud and back.

IoT devices integrate audio processors to add rich capabilities like voice wake. While the cloud may offer some great benefits, edge processing allows users to harness the full capability of their device at any time without the need for a high bandwidth internet connection. For example, edge audio processors enable a superior user experience in virtual communication through low latency processing of audio with contextual data while also keeping the contextual data local and secure.

Challenges with Integrating Voice

The application opportunities for voice-calling, control, and interaction continue to rise. However, with more devices, more fragmentation is introduced, making it more difficult to integrate voice.  How you integrate voice control into each application—be it Bluetooth speakers, home appliances, headphones, wearables or elevators—will be different. Adding a voice wake trigger could be simple, but designing an enterprise-grade Bluetooth speaker and headset is a lot more complex. If that speaker includes true wireless stereo (TWS) integration, the complexity rises once again.

Additionally, various applications require voice integrations with different ecosystems. For instance, you need to work in a Linux ecosystem to implement voice on most smart TVs, but to get voice on a home appliance will require working in a microcontroller (MCU) ecosystem. For all of these integrations there is a common, recommended way to do it, but there are always variations, which adds to the complexity.

High-quality, mass-market development solutions are critical to overcoming these challenges and bringing new technology to market quickly to support the fast-evolving way that we are working, living, and communicating. In meeting these challenges, suitable solutions need to address multiple design requirements.

Addressing Key Design Requirements

Power Consumption

In order for a VUI device to receive commands, it must be always-on/always-listening for commands. Whether these devices are plugged in, and especially if they are battery-operated, the restriction on power consumption can be a major design challenge.

In a voice-command system, at least one microphone must always be active, and the processor tasked with recognizing the wake word must also be active. Audio edge processors designed with proprietary architectures, hardware accelerators, and special instruction sets can optimally run audio and ML algorithms. These optimizations help with reducing power consumption.


There is no tolerance for latency with voice activated devices. Even if there is a perceived delay over 200 milliseconds, humans begin talking over each other on voice calls, or repeat their commands to the voice assistant. To develop voice integrated devices that will gain the necessary consumer acceptance, engineers and product designers must provide optimized audio chains throughout the system to comply with industry specifications and best user experiences. Low latency processing in edge processors is, therefore, a critical requirement for ensuring high-quality voice communication.


Because there are many options when it comes to choice of hardware and software for different VUI implementations, there are requirements that can become a challenge at various points in the integration stage. Some key design considerations to consider along the way include those discussed below.

Hardware Integration

There are various hardware architectures for implementing a VUI system depending on the device usage, application, and ecosystem. Each VUI device will include microphones, either a single microphone or a microphone array, connected to an audio processor for capturing and processing audio. In this recent Embedded article from Knowles, my colleague reviews the hardware architecture considerations for implementing a VUI system and the benefits and drawbacks to each.

Host Software Integration

As mentioned above, there are various operating systems, and drivers to choose from. Ideally the audio processor will come with firmware and a set of drivers that configures to connect with the host processor. The operating system, such as Android or Linux, usually runs on the host processor.

Driver software components that run in the kernel space interact with the firmware over control interface and audio data from the audio edge processor can be read in the user space via standard Advanced Linux Sound Architecture (ALSA) interface.

To integrate the software with rest of the host system, it can become a complex job to connect the audio processor driver provided in the software release package into the kernel image. This involves copying the driver source code into the kernel source tree, updating some kernel configuration files, and adding device tree entries according to the relevant hardware configuration.

A solution to this would be to use pre-integrated standard reference designs with exact or similar configurations.

In an ideal situation, the audio edge processor would provide streamlined software stacks for integration and come with pre-integrated and verified algorithms as a system level solution to further simplify the process. 

Algorithm integration

While we’re on the topic of algorithm integration. There are typically multiple algorithms cascading to switch between different use cases at any given time. Even for voice wake, a design needs multi-mic beamformers, an edge voice wake engine and cloud-based verification. This means at least three algorithms working together to optimize performance. For any device integrating with Alexa or Google Home keywords, there must be multiple algorithms, often coming from different vendors, that have to be optimized together in one device.

One solution is to choose an audio edge processor that comes pre-integrated with verified algorithms, developed and tested independent of the host system.

Form factor Integration

There are many form factors that devices can take today. Each has its own configuration of multiple microphones installed. The distance and placement of microphones and speakers play a big role in performance. Performance tuning and optimization have to change based on the final form factor and target use cases. There are also variations of manufacturing that impact performance such as microphone sealing, acoustic treatments on the device, vibration dampening, and more.


Many audio processors detect the wake word then immediately send the information to the cloud where it is interpreted and acted upon. A big problem is that once the audio data is in the cloud the user has no control over the data hence exposed to a high risk of privacy. The solution to this challenge is to choose an edge AI processors that can perform the command interpretation and response logic on the device, locally, “at the edge.”

This enables sensitive personal audio data stays local, without being sent to the cloud where it can be used against our wishes. The VUI implementation is now not only much more private but it can respond faster, making users interactions much more natural. This is a great example of how edge AI processors can advance existing use cases to maximize the helpfulness of the devices we use and trust every day.

The Hardware And Software Interface

The design requirements for VUI implementations can be complex and can make it challenging to brings devices with voice integration to market quickly. OEMs and system integrators can drastically reduce risk by working with standard solution development kits such as the Knowles AISonic Bluetooth Standard Solution Kit. Such kits offer preconfigured starting points for prototypes that allow the designers to develop their own innovations on top without having to worry about the design challenges we discussed above. Designers should look for development kits that have pre-integrated and verified algorithms, pre-configured microphones and drivers that are compatible to the host processor and operating systems.

Audio edge processors that open their architectures and development environments accelerate innovation by providing audio application developers the tools and support to create new devices and applications. Future audio devices will be a collaborative effort.

Scott Choi has over 20 years of experience in software and applications engineering, product management, and marketing for the technology industry. In his current role as Sr. Director, Engineering at Knowles, Scott works for the AISonic business unit to build and develop software for processing and audio intelligence. Scott’s experience includes audio system architecture, consumer electronics, OEM partnership, and mobile audio, which he leverages to lead strategic initiatives for the AISonic group Scott holds a BS in Computer Science from the University of Utah.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.