The potential applications around artificial intelligence (AI) continue to grow on a daily basis. As the power of different neural network (NN) architectures are tested, tuned and refined to tackle different problems, diverse methods of optimally analyzing data using AI are found. Much of today’s AI applications such as Google Translate and Amazon Alexa’s speech recognition and vision recognition systems leverage the power of the cloud. By relying upon always-on Internet connections, high bandwidth links and web services, the power of AI can be integrated into Internet of Things (IoT) products and smartphone apps. To date, most attention is focused on vision-based AI, partly because it is easy to visualize in news reports and videos, and partly because it is such a human-like activity.
Sound and Vision Neural Network (Image: CEVA)
For image recognition, a 2D image is analysed – a square group of pixels at a time – with successive layers of the NN recognizing ever larger features. At the beginning, edges of high difference in contrast will be detected. In a face, this will occur around features such as the eyes, nose, and mouth. As the detection process progresses deeper into the network, whole facial features are detected. In the final phase, the combination of features and their position will tend toward a specific face in the available dataset being identified as a likely match.
Neural Network feature extraction (Image: CEVA)
The hope is that the neural network will provide the highest probability of a match with the face in its database that matches that of the subject photographed or captured by a camera. The clever element here is that the subject may not have been captured at exactly the same angle or pose as the photograph in the database, nor under the same lighting conditions.
AI has become so prevalent so quickly in a large part due to open software tools, known as frameworks, that make it easy to build and train an NN for a target application in a variety of programming languages. Two such common frameworks are TensorFlow and Caffe. Where the item to be recognized is already known, an NN can be defined and trained offline. Once trained, the NN can then be easily deployed to an embedded platform. This is a clever partitioning that allows the power of a development PC or the cloud train the NN, while the power-sensitive embedded processor is simply using the training data for the purposes of recognition.
The human-like ability to recognize people and objects is closely linked with trendy applications, such as industrial robots and autonomous cars. However, AI is of equal interest and capability in the field of audio. In the same way that features can be analyzed in an image, audio can be broken down into features that can be fed into an NN. One method uses the Mel-Frequency Cepstral Coefficient (MFCC) to break audio down into usable features. Initially, the audio sample is broken down into short time frames, e.g. 20ms, and then, using Fourier transforms of the signal, powers of the audio spectrum are mapped onto a nonlinear scale using triangular overlapping windows.
Sound Neural Network Breakdown (Image: CEVA)
Continue reading on page two >>
With these features extracted, the NN is used to determine the audio sample’s similarity to a database of audio samples of words or sounds. Like image recognition, the NN delivers a likelihood of a match to a specific word in its database. For those wanting to replicate Google and Amazon’s ‘OK Google’ or ‘Alexa’ Voice Trigger (VT) functionality, KITT.AI provides one solution with Snowboy. A trigger word can be uploaded to their platform for analysis, resulting in a file which can be integrated into a Snowboy application on embedded platforms. The VT word can then be detected without an Internet connection. Audio recognition is not limited to the spoken word. TensorFlow provides an example project for iOS that can distinguish between a female and male voice.
An alternative application is the detection of animals and other sounds in and around our cities and homes. This has been demonstrated with a deep-learning bat monitoring system installed in the Queen Elizabeth Olympic Park in the United Kingdom. This opens up the possibility of combining visual and audio recognition NN into a single platform. For example, audio recognition of specific sounds could be used to trigger the video recording of a security system.
There are many applications where cloud-based AI support is undesirable, due to data privacy concerns, or untenable, due to poor data connectivity or bandwidth issues. In other cases, real-time performance is an area of concern. For example, industrial manufacturing systems demand an instantaneous response in order to undertake real-time actions on a production line, and the latency associated with a cloud-based service simply takes too long.
As a result, there is increased interest in moving AI to the edge. That is, placing the power of AI at the point at which it is used. Various IP vendors provide solutions such as CEVA's CEVA-X2 and NeuPro IP cores that together with software integrate easily with the existing NN frameworks. This opens up the possibility of developing embedded systems that are AI capable, whilst benefiting from the flexibility of low-power processor functionality. A voice recognition system, as an example, could make use of silicon and power optimized integrated AI to recognize a VT and a minimal subset of voice commands (VC). More complex VCs and functionality could then be handled by a cloud-based AI once the application has woken-up from its low-power, voice trigger state.
Finally, Convolutional NNs (CNN) are also being used to improve the quality of text-to-speech (TTS) systems. Historically, TTS has used a concatenative process to stitch together many tiny chunks of high-quality voice recordings from a single voice actor. The results are understandable but still have the feeling of a robotic voice due to the strange intonation and inflection in the resulting output. Trying to inflect different emotions requires a completely new set of voice recordings. Google’s WaveNet improves on the current state of the art by using a CNN to generate the TTS waveforms from scratch at 16,000 samples per second. The resulting examples are seamless, of noticeably higher quality and more natural sounding than previous examples.
Youval Nachum serves as CEVA’s Senior Product Marketing Manager for audio and voice product line. Youval brings over 20 years of multi-disciplinary experience, spanning marketing, system architecture, ASIC, and software domains at leading technology companies. He is passionate about anticipating long term trends and leading technical programs to their successful completion. Highly proficient in combining market requirements, product definitions, industry standards and design innovations into breakthrough products, Youval holds a B.Sc. and M.Sc. in Electrical Engineering from the Technion – Israel Institute of Technology.