As we move towards more ubiquitous, always-on sensing and computing, power becomes increasingly important. There’s perhaps no better an example of where this is important than the voice-activated devices on our desks, in our pockets, and distributed around our homes. As we saw last year, keyword spotting in particular is currently a target for all kinds of neuromorphic technologies.
The silicon cochlea
The 2020 winner of the Misha Mahowald Prize for Neuromorphic Engineering is Prof. Shih-Chii Liu and her team, who have been working on low-latency, low-power sensors for detecting speech. The dynamic audio sensors that Shih-Chii Liu and her team at the Institute of Neuroinformatics (INI) have been developing could eventually address this market. At their core is a silicon cochlea designed to mimic biology. First the incoming sound is filtered into frequency channels using a set of analog bandpass filters, the output of which is half-wave rectified. Together, this emulates the function of hair cells in the ear.
In a conventional audio system, the sound is first converted using an analog-to-digital converter and then the features are extracted using digital fast Fourier transform (FFT) and bandpass filtering (BPF). These are processed by a digital signal processor (DSP) running voice activity detection (VAD) or automatic speech recognition algorithms. B. In the INI-Zurich dynamic audio sensor, the signal is received as analog audio bands from the with features and changes are encoded, in parallel, into trains of asynchronous spikes (events), which are then processed.
As happens in biology, the different channels are then readied for processing in the brain. In the ear, ganglion cells encode the signals as a rush of chemical ions: in the silicon cochlea, they are turned into electrical spikes. This can be done using either a classic integrate-and-fire function, or an asynchronous delta modulator (ADM) which compares the signal to two thresholds and sends events the appropriate as these are passed, so acting as a feature extractor. Because unchanging signals are ignored, the amount of redundant information passed on to the next stage is reduced.
From a power point of view, if nothing is happening, the silicon cochlea barely expends any energy, but as activity ramps up so do the number of spikes. Depending on the application, that can either be a huge advantage (if there’s lots of listening but very little action) or no advantage at all (when there’s relevant stuff to decode all the time).
However, as an audio sensor operating in the low-µW regime, the chip could offer system designers a valuable option to increase power efficiency. It also allows for a very high dynamic range, as there is almost infinite scope for spikes to be far apart or close together because they operate in continuous time.
A critical part of this work has been to demonstrate usefulness. Specifically, the event-streams produced by the silicon cochlea can be used in real applications like voice activity detection, the first stage of keyword recognition. Liu and her team have succeeded in doing this by using the event-output to create 2D frames of data: histograms of the arriving spikes, by frequency, arranged over the 5ms of the frame. Called cochleagrams, these can be read into a neural network and their meaning decoded from there.
According to Liu, “The use of deep networks on a sensor is of great interest to the IEEE ISSCC community and very timely given the current huge interest in audio edge computing.” There have been many papers on low-power ASICs for keyword spotting, she says, but these use conventional spectrogram-like features. One of her goals, “is to show that hybrid solutions (mixed analog-signal designs) could lead to even lower-power designs solution with lower-latency responses.”
Last year INI released a video showing the system recognizing digits (you can see Liu from about 2:06). It’s far from infallible, but it’s also still relatively early days in the system’s development. The team, which has included Minhao Yang, Chang Gao, Enea Ceolini, Adrian Huber, Jithendar Anumula, Ilya Kiselev, and Daniel Neil over the years, has also experimented with sensor fusion: Liu and her colleagues combined audio and visual information to make classification more reliable . They have been publishing initial design rules to choose when analog sensors are advantageous and when it’s better to stick to digital .
Misha Mahowald, one of the inventors of the address-event representation, and for whom the Neuromorphic Engineering Prize is named.
Another constant effort has involved improving the power efficiency and performance of the DAS. Part of this has involved looking at the implementation of the individual functions, from the source-follower-based bandpass filters to the design of the analog feature extractors.
Reducing of the effect of the variability in the analog electronics has been another important area of research. To help with this, they built a hardware emulator that they could use for testing these issues much more quickly, they say, than would be possible using commercial software such as Cadence Virtuoso. By training the binary neural net they use for classification from the software rather than the hardware, they were able to accurately predict the classification performance on a range of real test chips . They are now looking at adding noise to the system as a proxy for variability to make the design process even more robust.
Liu was one of the early researchers in neuromorphic engineering; she not only worked with in Carver Mead’s lab at Caltech (where Mahowald had worked) but was a founder member of the Institute of Neuroinformatics when many of the group left California for Zurich.
On winning the award, Liu said, “It is a great honor for us to be awarded this prize, especially with so many good researchers in neuromorphic engineering. The work built on decades of early silicon cochlea design extending from Dick Lyon, Carver Mead, Lloyd Watts, Rahul Sarpeshkar, Eric Vittoz, and Andre van Schaik.”
On the importance of neuromorphic engineering, she says, “Even at the end of Moore’s law, digital computation will lag behind biology’s energy efficiency by at least a factor of thousand. Thus, the potential efficiency of hybrid analog electronic systems such as DAS is becoming more important than ever.”
 D. Neil and S. C. Liu, “Effective sensor fusion with event-based sensors and deep network architectures,” in Proceedings – IEEE International Symposium on Circuits and Systems, Jul. 2016, vol. 2016-July, pp. 2282–2285, doi: 10.1109/ISCAS.2016.7539039.
 S. C. Liu, B. Rueckauer, E. Ceolini, A. Huber, and T. Delbruck, “Event-Driven Sensing for Efficient Perception: Vision and audition algorithms,” IEEE Signal Process. Mag., vol. 36, no. 6, pp. 29–37, Nov. 2019, doi: 10.1109/MSP.2019.2928127.
 M. Yang, S.-C. Liu, M. Seok, and C. Enz, “Ultra-Low-Power Intelligent Acoustic Sensing using Cochlea-Inspired Feature Extraction and DNN Classification.”
 M. Yang, C. H. Chien, T. Delbruck, and S. C. Liu, “A 0.5 V 55 μW 64 × 2 Channel Binaural Silicon Cochlea for Event-Driven Stereo-Audio Sensing,” IEEE J. Solid-State Circuits, vol. 51, no. 11, pp. 2554–2569, Nov. 2016, doi: 10.1109/JSSC.2016.2604285.
>> This article was originally published on our sister site, EE Times.
|Sunny Bains teaches at University College London, is author of Explaining the Future: How to Research, Analyze, and Report on Emerging Technologies, and is currently writing a book on neuromorphic engineering.|
- How extensive signal processing chains make voice assistants ‘just work’
- AI finds its voice in audio chain
- Squeezing speech-to-text inference models onto small MCUs
- Better audio processing at the edge
- Enhanced technologies will accelerate acceptance of voice assistants
For more Embedded, subscribe to Embedded’s weekly email newsletter.