Embedded speech 'how-to' at ESC Silicon Valley - Embedded.com

Embedded speech ‘how-to’ at ESC Silicon Valley

A couple of weeks ago, I posted this column about a Kickstarter project for a standalone speech recognition and synthesis shield called MOVI — which stands for “My Own Voice Interface” — for use with the Arduino and other MCU boards.

Well, I'm delighted to be able to say that the MOVI Kickstarter has already met its goal, and — as I pen these words — there are still 27 days to go.

I'm even more delighted to tell you that I got to chat with MOVI's creators, Bertrand Irissou and Gerald Friedland, and that I've managed to persuade them to present a session titled Embedding Speech Dialog Capabilities without the Cloud at the forthcoming ESC Silicon Valley, which is now only a week away!

Bertrand is an electrical and hardware engineer, while Gerald is a software and speech engineer with a PhD in computer science. It turns out that the initial impetus for MOVI was when one of Bertrand and Gerald's colleagues stated that state-of-the-art speech recognition could only be achieved in the cloud using honking big servers, and they decided to prove him wrong (I wonder how many other inventions started off in a similar fashion).


Bertrand Irissou (left) and Gerald Friedland (right).

Now, there are a few low-end existing speech recognizers out there, but they are generally user-dependent and you have to train them to your voice. Also, these existing solutions can typically only work with a few single words or very short phrases. Bertrand and Gerald wanted their solution to work with anyone's voice and to support a large number of arbitrarily long command sentences.

Before we go further, let's first remind ourselves of the most basic use model. You start off by defining the sentences you expect to employ in your setup() function; for example:

recognizer.addSentence (1, "Turn table light on");
recognizer.addSentence (2, "Turn table light off");
recognizer.addSentence (3, "Turn ceiling light on");
recognizer.addSentence (4, "Turn ceiling light off");
    :

Later, in the main body of your code, you might use something like:

speachResult = recognizer.poll();
if (speachResult == 1) {
   // Your commands go here
}

One question I had was about the on-board dictionary (the Kickstarter website says this is 2GB, but I'm informed that they've now increased this to 4GB). Ever since I posted my first column, I've been wondering how this dictionary fit into the picture.

Well, it turns out that the dictionary contains the allophones and phonemes (the sounds) used to construct words in English. The system takes the words in the sentences you've defined and uses the dictionary to work out the sound envelopes it should be looking for. I'm assuming they are using some form of spectral slope encoding, but I'm not 100% sure.

This dictionary wouldn’t be required for languages like Spanish (Bertrand and Gerald intend to add multi-language support in the future) because there's only one way to pronounce each written word in those languages. By comparison, in English we can have words that are spelt differently and sound the same, or words that are spelt the same and pronounced differently, and then things start to get complicated (LOL).

It also turns out that the polling scheme presented above is just one use model; it's the most intuitive way to get beginner users up and running as quickly and easily as possible. There's also a low-level serial interface that allows one to do truly awesome things.

What sort of awesome things? Well, I don’t want to give too much away here. Suffice it to say that, in their presentation at ESC Silicon Valley, Bertrand and Gerald will be giving practical tips for embedding speech dialog capabilities into everyday devices without relying on the cloud; they will be using MOVI as an example; and they will explain all of the cunning tricks and techniques they use to make MOVI perform its magic.

Have you signed up for ESC Silicon Valley yet? If not, why not? I'll tell you what; why not get your pass right now while there are still a few good seats left?

I tell you, I want to add voice control capabilities to all of my projects, so this is one session I'm definitely going to attend. I'll be easy to spot. Just keep your eyes open for a tall, dark, handsome stranger… then keep on looking around the room until you see a geek in a Hawaiian shirt… I'll leave it up to you to decide which one is yours truly. If you still can't decide, just shout “Max! Beer!” (or “Max! Bacon!”) and observe who responds the fastest.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.