Smart speakers and voice-controlled devices are becoming increasingly popular, with voice assistants such as Amazon’s Alexa and Google’s assistant getting better and better at understanding our requests.
One of the main appeals of this kind of interface is it ‘just works’ – there’s no user interface to learn, and we can increasingly talk to a gadget in a natural language as if it were a person, and get a useful response. But to achieve this capability, there’s a huge amount of sophisticated processing going on.
In this article, we’ll look at the architecture of voice-controlled solutions, and discuss what’s happening under the hood, and the hardware and software required.
Signal flow and architecture
While there are many kinds of voice-controlled devices, the basic principles and signal flow are similar. Let’s consider a smart speaker, such as Amazon’s Echo, and look at the major signal processing subsystems and modules involved.
Figure 1 shows the overall signal chain in a smart speaker.
click for larger image
Figure 1: Signal chain for voice assistant, based on CEVA’s ClearVox and WhisPro. (Source: CEVA)
Starting at the left of the diagram, you can see that, once a voice is detected using voice activity detection (VAD), it is digitized, and passed through multiple signal processing stages to improve the clarity of the voice from the desired main speaker’s voice direction of arrival. The digitized, processed voice data is then passed to back-end speech processing, which may take place partly at the edge (on the device) and partly in the cloud. Finally, a response, if needed, is created, and output by the speaker, which requires decoding and digital-to-analog conversion.
For other applications, there may be some differences, and varying priorities – for example, an in-vehicle voice interface would need to be optimized to handle typical background noise in cars. There is also an overall trend towards lower power and reduced costs, driven by demand for smaller devices such as in-ear ‘hearables’ and low-cost home appliances.
Front-end signal processing
Once a voice has been detected and digitized, multiple signal processing tasks are required. As well as external noise, we also need to consider sounds generated by the listening device, for example a smart speaker outputting music or a conversation with a person speaking on the other end of the line. To suppress these sounds, the device uses acoustic echo cancellation (AEC), so the user can barge-in and interrupt a smart speaker, even when it’s already playing music or talking. Once these echoes are removed, noise suppression algorithms are then used to clean up external noise.
While there are many different applications, we can generalize them into two groups for voice-controlled devices: near-field and far-field voice pickup. Near-field devices, such as headsets, earbuds, hearables, and wearables, are held or worn near the user’s mouth, while far-field devices such as smart speakers and TVs are designed to listen to a user’s voice from across a room.
Near-field devices typically use one or two microphones, but far-field devices often use somewhere between three and eight. The reason for this is the far-field device faces more challenges than near-field: as the user moves further away, their voice reaching the microphones becomes progressively quieter, while background noise stays at the same level. At the same time, the device also has to separate out the direct voice signal from reflections off walls and other surfaces, aka reverberation.
To handle these issues, far-field devices employ a technique called beamforming. This uses multiple microphones, and calculates the direction of the sound source based on the time differences between the sound signals arriving at each microphone. This enables the device to ignore reflections and other sounds, and just listen to the user – as well as to track their movement, and zoom in on the correct voice where there are multiple people talking.
For smart speakers, another key task is to recognize the ‘trigger’ word, such as ‘Alexa’. As the speaker is always listening, this trigger recognition raises privacy issues – if the user’s audio is always being uploaded to the cloud, even when they don’t say the trigger word, do they feel comfortable with Amazon or Google hearing all their conversations? Instead, it can be preferable to handle the trigger recognition, as well as many popular commands such as “volume up” locally on the smart speaker itself, with audio only being sent to the cloud after the user has started a more complex command.
Finally, the clean voice sample must be encoded before finally being sent to the cloud back-end for further processing.
It’s clear from the description above that the front-end voice processing has to be able to handle a lot of tasks. It must do this quickly and accurately, and for battery-powered devices, the power consumption must be kept to a minimum – even when the device is always listening for a trigger word.
To meet these demands, general-purpose digital-signal processors (DSPs) or microprocessors are unlikely to be up to the job – in terms of cost, processing performance, size and power consumption. Instead, a better solution is likely to be an application-specific DSP, with dedicated audio processing functions, and optimized software. Choosing a hardware/software solution that is already optimized for the voice input tasks will also reduce development costs and cut time to market substantially, as well as bringing down the overall costs.
For example, CEVA’s ClearVox is a software suite of voice input processing algorithms, that can cope with different acoustic scenarios and microphone configurations, including speaker’s voice direction of arrival, multi-mic beamforming, noise suppression, and acoustic echo cancellation. ClearVox is optimized to run efficiently on CEVA sound DSPs, to provide a cost-effective, low-power solution.
As well as voice processing, the edge device will need the capability to handle trigger word detection. A specialized solution, such as CEVA’s WhisPro, is an excellent way to achieve the accuracy and low power consumption needed (see Figure 2). WhisPro is a neural network-based speech recognition software package, available exclusively for CEVA’s DSPs, which enables OEMs to add voice activation to their voice-enabled products. It can handle the always-on listening required, while a main processor stays asleep until needed, thus reducing overall system power consumption significantly.
click for larger image
Figure 2: using voice processing and speech recognition for voice activation. (Source: CEVA)
WhisPro can achieve a recognition rate of more than 95%, and can support multiple trigger phrases, as well as customized trigger words. As anyone who has used a smart speaker can testify, getting it to respond reliably to the wake word – even in a noisy environment – can sometimes be a frustrating experience. Getting this feature right can make a huge difference to how consumers perceive the quality of a voice-controlled product.
Speech recognition: local or cloud
Once the voice has been digitized and processed, then we need some kind of automatic speech recognition (ASR) capability. There is a wide range of ASR technologies, ranging from simple keyword detection that requires a user to say specific keywords, up to sophisticated natural language processing (NLP), where a user can talk normally as if addressing another person.
Keyword detection has many uses, even if its vocabulary is extremely limited. For example, a simple smart home device such as a light switch or thermostat may just respond to a few commands, such as ‘on’, ‘off’, ‘brighter’, ‘dimmer’ and so on. This level of ASR can easily be handled locally, at the edge, without an Internet connection – thus keeping costs down, ensuring a fast response, and avoiding security and privacy concerns.
Another example would be that many Android smartphones can be told to take a picture by saying ‘cheese’ or ‘smile’, where sending the command to the cloud would simply take too long. And that’s assuming an Internet connection is available, which is not always going to be the case for a device such as a smartwatch or hearable.
On the other hand, many applications require NLP. If you want to ask your Echo speaker about the weather, or to find you a hotel for tonight, then you might phrase your question in many different ways. The device needs to be able to understand the possible nuances and colloquialisms in the command, and to reliably work out what’s been asked. Put simply, it needs to be able to convert speech to meaning, rather than just speech to text.
To take our hotel enquiry as an example, there’s a huge range of possible factors you might want to ask about: price, location, reviews, and many others. The NLP system has to interpret all of this complexity, as well as the many different ways a question might be phrased, and a lack of clarity from the request – saying ‘find me a good value, central hotel’ will mean different things to different people. Achieving accurate results also needs the device to consider the context of the question, and to recognize when the user asks connected follow-up questions, or asks for multiple pieces of information within one query.
This can take a huge amount of processing, typically using artificial intelligence (AI) and neural networks, which is mostly not practical for processing only at the edge. A low-cost device with an embedded processor will not have enough power to handle the tasks required. In this case, the right option is to send the digitized speech for processing in the cloud. There, it can be interpreted, and an appropriate response sent back to the voice-controlled device.
You can see there are trade-offs between edge processing on the device, and remote processing in the cloud. Handling everything locally can be faster, and doesn’t rely on having an Internet connection, but will struggle to deal with a wider range of questions and information fetching. This means that for a general-purpose device, such as a smart speaker in the home, it’s necessary to push at least some processing to the cloud.
To address the drawbacks of cloud processing, there are developments being made in the capabilities of local processors, and in the near future we can expect to see big improvements in NLP and AI in edge devices. New techniques are reducing the amount of memory required, and processors continue to get faster and less power-hungry.
For example, CEVA’s NeuPro family of low-power AI processors provides sophisticated capabilities for the edge. Building on CEVA’s experience in neural networks for computer vision, this family delivers a flexible, scalable solution for on-device speech processing.
Voice-controlled interfaces are fast becoming a significant part of our everyday lives, and will be added to more and more products in the near future. Improvements are being driven by better signal processing and voice recognition capabilities, as well as more powerful computing resources, both locally and in the cloud.
To meet the requirements of OEMs, the components used for audio processing and speech recognition need to meet some tough challenges, in terms of performance, cost and power. For many designers, solutions that have been specifically optimized for the tasks at hand may well prove the best approach – meeting end-customer demands, and reducing time to market.
Whatever the technology they are based on, voice interfaces will get more accurate and easier to talk to in everyday language, while their falling costs will make them more appealing to manufacturers. It’s going to be an interesting journey to see just what they’re used for next.
Moshe Sheier is Vice President of Marketing at CEVA, where he oversees corporate marketing and strategic partnerships for CEVA’s core target markets and future growth areas. Moshe is engaged with leading SW and IP companies to bring innovative DSP-based solutions to the market. He holds several patents in the area of DSP architectures, a B.Sc. in Electrical Engineering from Tel-Aviv University, and an M.B.A from the Open University of Israel.