New technology from a Canadian startup means AI models for natural language processing can run efficiently on small CPUs and even microcontrollers. Voice control functionality, typically done via internet connection to the cloud today, can now be added to all manner of appliances.
Startup PicoVoice (Vancouver, Canada) has launched a compact speech-to-text inference engine that can run on minimal compute resources. Compared to competing natural language processing solutions for the edge, PicoVoice technology uses an order of magnitude less resources in terms of both compute and memory, the company says. This can enable voice recognition on all kinds of devices, without needing to send any data to the cloud.
While the processing-in-the-cloud model is well-understood for assistants like Amazon Alexa and Google Home, it may not translate to voice recognition in edge devices that require strict privacy or low cost.
“As [voice-enabled] devices become more prevalent, processing everything on the server side wouldn’t work, financially,” said Alireza Kenarsari-Anhari, Founder and President, PicoVoice. “Compute resources are not free. To make a voice interface for everything, you need to make it cheap enough. Running on the device is the only way to do this.”
For example, according to Kenarsari-Anhari, a voice-activated coffeemaker using public cloud services, if used 10 times a day, would cost the device maker around $15 per year, per appliance.
“You can do this for free if you use the resources you already have on the CPU on your coffeemaker,” he said.
Performing voice recognition on the edge can also allow better latency and reliability, depending on the exact application.
Voice-activated assistants, like Amazon Alexa use the cloud for natural language processing, but this model may not work for cheaper appliances (Image: Loewe Technologies
PicoVoice’s new product is a machine learning model for speech-to-text transcription that runs on a small CPU, like the ARM11 core on a Raspberry Pi Zero. The model can understand around 200,000 English words with a word error rate comparable to cloud-based home assistants. This could be used in devices that require transcription capability outside the cloud.
“There is some activity in the market around capturing or summarising what happens in company meetings,” said Kenarsari-Anhari, citing companies that don’t want to submit proprietary information into the cloud, or companies that have huge volumes of data to transcribe, where doing it in the cloud would be cost prohibitive.
The speech-to-text engine joins the company’s two existing products. The first, a wake word engine, can be customised to accept any wake word quickly and cheaply using transfer learning.
The second, a speech-to-intent engine for appliances, can understand voice commands within a limited domain (such as asking to turn the lights on or off).
“If I have a well-defined domain and the user is going to issue spoken commands in that domain, we can do natural language understanding in that domain, and we can do it very efficiently, to the point where the whole model is less than half a megabyte. That’s why we can do it on a sub-$1 MCU,” Kenarsari-Anhari said. “If a customer wants to make smart fridge, with a defined set of spoken commands, we will train the model for that specific application, and they deploy it in their fridge, and pay us royalties.”
How it Works
In order to run natural language processing models on small CPUs, PicoVoice has invented a new way of training models that makes them smaller and more computationally efficient.
“We look at the instruction set on the target device and try to find mathematical operations that are efficient to implement using those instructions,” Kenarsari-Anhari said. “We mimic matrix multiplication with a different mathematical operation, that is more efficient to implement using the instructions on that device.”
This means trained models are device-specific, as they depend on the exact instruction set used, but in practice, he says, the vast majority of audio processors are based on just three options (ARM, Tensilica HiFi and Ceva TeakLite).
“We found instructions on these three different classes of CPU where we can very efficiently implement something that mimics matrix multiplication,” he said. “We can train the model for these three different targets, but the way we train our models for ARM is different from the way we train our model for a Tensilica HiFi, for example. From the user’s point of view, [the models] give similar performance, but the underlying mathematical formulation is different, which results in efficient execution on the target device.”
While Kenarsari-Anhari declined to go into further detail about exactly which instructions PicoVoice uses, he said the basic concept is similar to Seattle-based Xnor, which accelerates computer vision models using the XNOR instruction. However, accelerating vision models, which are typically based on convolutional neural networks (CNNs) is a simpler task than accelerating voice models, which are based on recurrent neural networks (RNNs).
For a CNN looking at camera pictures, what the model sees is bounded, he explained, but RNNs include the concept of time.
“With voice, as I am talking, your brain keeps the history of what I said, and uses it to do inference on what I’m saying now,” he said. “The reason accelerating RNNs is harder is because having no memory helps you not to compound errors. Accelerated models typically have more noise in them, and for RNNs, that noise can accumulate over time and make the neural network unstable.”
PicoVoice’s core team of “just under 10” people is mostly from Amazon, including Kenarsari-Anhari, who started the company in January 2018. PicoVoice received a grant from the National Research Council of Canada under the Industrial Research Assistance Program (IRAP) but has had no other external funding to date.
The decision not to raise funding has allowed the company time to “solve fundamental problems with experimental development and applied research,” Kenarsari-Anhari said.
The company already has revenue streams from a number of customers, including LG, Whirlpool and Local Motors.
>> This article was originally published on our sister site, EE Times: “Understand Speech on a Sub-$1 MCU.”