Those clever folks at XMOS have just brought us one step closer to embedding “ears” for voice control in just about every device with which we interact.
As a reminder, XMOS is a fabless semiconductor company that develops voice solutions, audio products, and multicore microcontrollers capable of concurrently executing real-time tasks, extreme digital signal processing (DSP), and control flow. XMOS microcontrollers are distinguished by their deterministic (predictable) behavior.
Let’s start with the underlying xCORE multicore microcontroller technology, which comprises multiple “processor tiles” connected by a high-speed switch. Each processor tile is a conventional RISC processor that can execute up to eight tasks concurrently. Tasks can communicate with each other over channels (that can connect to tasks on the local tile or to tasks on remote tiles) or by using memory (for tasks running in the same tile only).
The xCORE architecture delivers, in hardware, many of the elements that are usually seen in a real-time operating system (RTOS). This includes the task scheduler, timers, I/O operations, and channel communication. By eliminating sources of timing uncertainty (interrupts, caches, buses, and other shared resources), xCORE devices can provide deterministic and predictable performance for many applications. A task can typically respond in nanoseconds to events such as external I/O or timers. This makes it possible to program xCORE devices to perform hard real-time tasks that would otherwise require dedicated hardware.
In 2017, XMOS acquired Setem Technologies. As I wrote in my column “XMOS + Setem Could Be a Game-Changer for Embedded Speech”: “The chaps and chapesses at Setem are the pioneers of Advanced Blind Source Signal Separation technology. Their patented algorithms enable consumer devices to focus on a specific voice or conversation within a crowed audio environment to achieve optimized input into speech-recognition systems.”
I have two Amazon Echo/Dot devices at home and one in my office (I asked my wife, Gina the Gorgeous, why she was whispering. “I heard that the folks at Amazon might be listening to us,” she said. I laughed, Gina laughed, Alexa laughed…). I think that these devices are awesome, but they do require an array of seven microphones, which increases both the cost and the physical footprint of the overall solution.
Having multiple microphones allows the system to better detect and remove noise, perform things like echo cancellation, and determine the location of sound sources such as a person speaking. Of course, when you think about it, we manage to do all of this stuff with just two ears (I don’t know about you, but I don’t think I have enough room on my head to accommodate seven ears without at least one of them getting in the way).
Not surprisingly, the folks at XMOS also spotted this, which is why they just introduced their new XVF3510 next-generation voice processor that can pluck an individual voice out of a crowded audio landscape using just two microphones.
XVF3510 mounted on a PCB (Source: XMOS)
The algorithms running on the XVF3510 include interference cancellation (which nulls point noise sources to cancel out unwanted background noise), stereo acoustic echo cancellation (which suppresses unwanted speaker echo and enables barge-in), and adaptive delay estimation (which dynamically adjusts audio reference signal latency, thereby ensuring that the echo-cancellation algorithms deliver a smooth, real-time experience).