XMOS + Setem could be a game-changer for embedded speech
I've been chatting a lot with the folks at XMOS recently. This is because they are focusing on an area that's dear to my heart: solutions to facilitate embedded speech technology.
For many applications and systems, an embedded speech-based user interface (a.k.a. voice control) is the obvious best-choice scenario. Just a few minutes ago, for example, someone reminded me of Paul Simon's The Boy in the Bubble. This made me want to hear the song. Rather than searching for it using my web browser or Spotify app, all I had to do was talk to the Amazon Echo on my desk and say: "Alexa, play The Boy in the Bubble by Paul Simon from Spotify." Within seconds I was enjoying this tune.
There are, of course, issues to be overcome, such as problems arising from the cocktail party effect. This is the phenomenon of being able to focus one's auditory attention on a particular stimulus while filtering out a range of other stimuli, as when a partygoer can focus on a single conversation in a noisy room. It seems that XMOS is on track to address these problems.
The cocktail party effect (Source: XMOS)
Let's start with the fact that XMOS xCORE devices boast multiple deterministic processor cores that allow concurrent independent task execution. These cores are augmented by a hardware scheduler that provides RTOS-like functionality at blinding speed. In addition to offering a mind-boggling level of raw compute power, any external interfaces and peripherals are implemented in software, thereby allowing designers to choose the exact combination of required interfaces.
A couple of weeks ago, the guys and gals at XMOS announced their XVF3000 family of voice processors that are tailored for integrated far-field voice capture. XVF3000 devices can be equipped with speech enhancement algorithms. These include an adaptive beamformer, which uses signals from multiple microphones to track a talker as he or she moves around, all coupled with high performance full-duplex, acoustic echo cancellation.
XVF3000 devices can be easily integrated with an applications processor or host PC via either USB for data and control or a combination of I2S and I2C. Developers can quickly add custom voice and audio processing to their systems using the XMOS free development tools.
At the same time, XMOS also announced the availability of the VocalFusion Speaker development kit, which includes an XVF3000 processor card and a microphone array.
XVF3000 processor card (bottom) and circular microphone array (top) (Source: XMOS)
The microphone array can be circular (as shown in the above image) or linear. The XVF3000 processor card aggregates multiple audio streams from the microphone array, converts the analog PDM microphone signals into digital PCM streams that DSP can be applied to, and passes the resulting optimized audio stream to remote or local ASR (automatic speech recognition) engines.
When used with circular or linear arrays, the VocalFusion Speaker development kit provides 360 degree and 180 degree far-field voice capture in excess of 5m distance, to enable a range of products that include smart speakers, smart TVs, computers and laptops, and robots.
But wait, there's more. The hot-off-the-press news this week is that XMOS has just acquired Setem Technologies. "Who are these folks and what do they do?" I hear you cry. Well, I'm glad you asked. The chaps and chapesses at Setem are the pioneers of Advanced Blind Source Signal Separation technology. Their patented algorithms enable consumer devices to focus on a specific voice or conversation within a crowed audio environment to achieve optimized input into speech recognition systems.
Isolating individual speakers and interferers (Source: XMOS)
To be honest, I think the combination of XMOS devices and Setem algorithms are going to be a game-changer that jumps XMOS out to the front of the speech recognition pack. What they can now do is mindboggling in its complexity and capability.
Let's assume you have a bunch of people talking together in a room. Let's also assume that there's a mix of traditional noise sources like air conditioners, and more problematic noise sources like a television news program with lots of "talking heads."
XMOS now has the capability to analyze the entire sound space and disassemble it into the individual elements -- including live human speakers and interferers like air conditioners and television sets (I have no idea how they can distinguish someone talking in the real world from someone talking on television, but they assure me they can).
And it gets even better, because once they've isolated and identified everyone's voice print/signatures, they can track those people as they move around the room.
The important point to note is that the system doesn't simply focus on the loudest speaker. It disassembles the sound space and simultaneously tracks all of the speakers all of the time.
The folks at XMOS aren’t in the business of delivering products to end users; rather, they are solution providers who give embedded developers the tools with which to create awesome systems. For example...
One of the first potential products that popped into my mind was an XMOS-enabled conference phone system. You wouldn’t even need to be on a call; you might just have a bunch of people gathered together in a room. At the end of the conversation, you could get a printout like a script for a play:
Person #1: I'd like to go on the record as saying that I think
that Max is truly Magnificent. Person #2: I can't argue with you there. Max is a prince
amongst men. Person #3: I agree, what can you say about Max that hasn’t
been said before?
Later, you could assign real names to the individual speakers. Alternatively, the system could ask all of the participants to identify themselves at the start of the session (and it could remember them for future sessions).
This led me to another thought; my wife (Gina the Gorgeous) and I often disagree as to who said what when. "You never mentioned that you were going to build a gigantic robot in the front room," she will say. "Of course I did," I'll reply, "you just weren't listening." And she will respond with something along the lines of, "I think I'd remember you saying you were going to build something like a robot in the front room!"
Well, now I can imagine a scenario whereby I say something like "Alexa, did I tell Gina I was planning on building a robot in the front room," and Alexa might reply, "Yes, early last month, on Saturday June 3, you said 'I'm toying with the idea of building a robot in the front room,' and Gina responded, 'I don’t care what you do so long as you clean out the garage like you promised.'" Ha! I rest my case!
Obviously, there are all sorts of ethical, sociological, and philosophical (and possibly matrimonial) considerations that have to be addressed -- along with a bunch of security-related issues -- but those are discussions for another column. The point is that the guys and gals at XMOS are providing us with power beyond the ken of mortal men (and women, of course) -- we can but hope that we use this power only for the good.