Deep learning hits a sweet note

June 20, 2016

Max The Magnificent-June 20, 2016

My mind is buzzing with all sorts of thoughts, from the fact that a Swiss Army Knife seems to have a tool for every occasion to the old saying that if the only tool you have is a hammer, then everything around you looks like a nail.

These meandering musings were sparked by my discovering a new application for deep neural networks (DNNs) and deep learning. As you may recall, a couple of weeks ago I had my mind boggled at the Embedded Vision Summit (see Day 1 and Day 2). That summit left me thinking of DNNs and deep learning primarily in the context of embedded vision applications until... chum Jay Dowling sent me the links to two very interesting videos on YouTube. First we have this video of a guy singing the same song in a variety of different locations.

One thing that's of interest here is how various aspects of the sound -- character, quality, tone, reverberation, highness, lowness, etc. -- change depending on the surrounding environment. But the thing that's truly amazing about all this is how, if we were to see any of these clips in isolation, we wouldn’t think anything of the timbre of the sounds we were hearing (except for things like "I wish I could sing like that").

The point is that we are used to things sounding one way in the outside world, for example, and the same things sounding very different in a large, empty room and different again in a smaller room jam-packed with furnishings. Of course, the other side to this coin is that we would very quickly twig to the fact that something was "off" if we saw a video of someone appearing to sing outside while being presented with the audio recording from the large empty room.

All of the above leads us to this video, in which we see the results of a novel deep learning experiment. Researchers from MIT recorded ~1,000 videos capturing ~46,000 instances of different objects and materials being prodded, scraped, and hit with a drumstick. They then used a deep learning algorithm to analyze the videos and deconstruct the resulting sounds according to a variety of acoustical qualities.

Once the algorithm has been trained, the next step is to present the system with silent videos of someone prodding, scraping, and hitting things with the drumstick. The system is charged with generating accompanying sounds that are appropriate to the situation. Amazingly enough, even though this is very early in the game, in many cases the system can generate sounds sufficiently convincing to fool human observers.

Of course, you may be thinking to yourself "Why would anyone actually want to do this?" Well, several reasons spring to mind. Think of all the sound effects used in films and on television programs, for example; at some time in the future, it may be that these could be added automatically.

We're all used to the idea that images can be digitally manipulated using programs like Adobe's Photoshop -- hence the use of the term "Photoshopped" to refer to an image that has been altered or enhanced.


A more nefarious application for this auditory generation system might be creating incriminating videos with associated audio-realistic sound effects.

One very interesting potential application involves helping robots navigate their way around the world. In the case of humans, our kids use a variety of techniques to create (train) their worldview, including sticking everything they can lay their hands on inside their mouths to see what things taste like and poking, prodding, banging, and stroking objects to discover what they feel like and how they respond both physically and audibly (I still use both of these methods myself).

Later in life, as we stroll around, our brains are subconsciously observing the surrounding terrain and predicting what it will feel like to our feet and sound like to our ears. We expect different sensations from different materials like grass, gravel, and carpet. Consider what would happen if we were to be strolling along thinking about what to have for supper and we were to step from a grassy lawn onto a gravel path, for example. Suppose that, instead of hearing the expected "crunch" of gravel, we were actually presented with a "splosh" sound. In this case, we would immediately come to a grinding halt and focus our attention on the "here and now" trying to determine what was happening.

Similarly, a robot could look at the various materials in its path ahead and predict qualities such as their relative hardness and "give" and adjust its gait accordingly. When the robot did transition from walking on one material to another, various sensors would provide feedback as to the validity of its assumptions and allow it to adjust its motion to accommodate differences between the "expected" and the "actual." Adding auditory prediction and feedback to the mix will provide an extra level of sophistication to this process.

What say you? Can you think of other applications for deep-learning in general and this auditory deep-learning in particular?

Loading comments...