Deep learning hits a sweet note -

Deep learning hits a sweet note


My mind is buzzing with all sorts of thoughts, from the fact that a Swiss Army Knife seems to have a tool for every occasion to the old saying that if the only tool you have is a hammer, then everything around you looks like a nail.

These meandering musings were sparked by my discovering a new application for deep neural networks (DNNs) and deep learning. As you may recall, a couple of weeks ago I had my mind boggled at the Embedded Vision Summit (see Day 1 and Day 2). That summit left me thinking of DNNs and deep learning primarily in the context of embedded vision applications until…

…my chum Jay Dowling sent me the links to two very interesting videos on YouTube. First we have this video of a guy singing the same song in a variety of different locations.

One thing that's of interest here is how various aspects of the sound — character, quality, tone, reverberation, highness, lowness, etc. — change depending on the surrounding environment. But the thing that's truly amazing about all this is how, if we were to see any of these clips in isolation, we wouldn’t think anything of the timbre of the sounds we were hearing (except for things like “I wish I could sing like that” ).

The point is that we are used to things sounding one way in the outside world, for example, and the same things sounding very different in a large, empty room and different again in a smaller room jam-packed with furnishings. Of course, the other side to this coin is that we would very quickly twig to the fact that something was “off” if we saw a video of someone appearing to sing outside while being presented with the audio recording from the large empty room.

All of the above leads us to this video, in which we see the results of a novel deep learning experiment. Researchers from MIT recorded ~1,000 videos capturing ~46,000 instances of different objects and materials being prodded, scraped, and hit with a drumstick. They then used a deep learning algorithm to analyze the videos and deconstruct the resulting sounds according to a variety of acoustical qualities.

Once the algorithm has been trained, the next step is to present the system with silent videos of someone prodding, scraping, and hitting things with the drumstick. The system is charged with generating accompanying sounds that are appropriate to the situation. Amazingly enough, even though this is very early in the game, in many cases the system can generate sounds sufficiently convincing to fool human observers.

Of course, you may be thinking to yourself “Why would anyone actually want to do this?” Well, several reasons spring to mind. Think of all the sound effects used in films and on television programs, for example; at some time in the future, it may be that these could be added automatically.

We're all used to the idea that images can be digitally manipulated using programs like Adobe's Photoshop — hence the use of the term “Photoshopped” to refer to an image that has been altered or enhanced.


A more nefarious application for this auditory generation system might be creating incriminating videos with associated audio-realistic sound effects.

One very interesting potential application involves helping robots navigate their way around the world. In the case of humans, our kids use a variety of techniques to create (train) their worldview, including sticking everything they can lay their hands on inside their mouths to see what things taste like and poking, prodding, banging, and stroking objects to discover what they feel like and how they respond both physically and audibly (I still use both of these methods myself).

Later in life, as we stroll around, our brains are subconsciously observing the surrounding terrain and predicting what it will feel like to our feet and sound like to our ears. We expect different sensations from different materials like grass, gravel, and carpet. Consider what would happen if we were to be strolling along thinking about what to have for supper and we were to step from a grassy lawn onto a gravel path, for example. Suppose that, instead of hearing the expected “crunch” of gravel, we were actually presented with a “splosh” sound. In this case, we would immediately come to a grinding halt and focus our attention on the “here and now” trying to determine what was happening.

Similarly, a robot could look at the various materials in its path ahead and predict qualities such as their relative hardness and “give” and adjust its gait accordingly. When the robot did transition from walking on one material to another, various sensors would provide feedback as to the validity of its assumptions and allow it to adjust its motion to accommodate differences between the “expected” and the “actual.” Adding auditory prediction and feedback to the mix will provide an extra level of sophistication to this process.

What say you? Can you think of other applications for deep-learning in general and this auditory deep-learning in particular?

5 thoughts on “Deep learning hits a sweet note

  1. “MaxnnAll this talk of sounds being associated with acts, and how to the brain might interpret them brought to mind on of my favourite Goon Shows. Unfortunately it doesn't appear to be available in audio (for free) but here is the portion of the script o

    Log in to Reply
  2. “The Goon Show was a classic — and the special effects were… different LOL In fact I have two books containing all the Good Show scripts here in my office (I know you're surprised LOL)”

    Log in to Reply
  3. “Eeek — you are correct this could be great for generating sound effects in VR worlds — especially if you are exploring them with companions — for example you could hear everyone's footsteps and the sound would change depending on what you are all walki

    Log in to Reply
  4. “Yeah pretty much but the resources used would determine when that becomes feasible. Ofc in gaming a smarter AI is the obvious application for deep learning.nnAn Android Wear watch face aimed at home control would be interesting. You could have a bunch o

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.