Once confined to cloud servers with practically infinite resources, machine learning is moving into edge devices for various reasons including lower latency, reduced cost, energy efficiency, and enhanced privacy. The time needed to send data to the cloud for interpretation could be prohibitive, such as pedestrian recognition in a self-driving car. The bandwidth needed to send data to the cloud can be costly, not to mention the cost of the cloud service itself, such as speech recognition for voice commands.
Energy is a trade-off between sending data back and forth to server vs. localized processing. Machine learning computations are complex and could easily drain the battery of an edge device if not executed efficiently. Edge decisions also keep the data on-device which is important for user privacy, such as sensitive emails dictated by voice on a smartphone. Audio AI is a rich example of inference at the edge; and a new type of digital signal processor (DSP) specialized for audio machine learning use-cases can enable better performance and new features at the edge of the network.
Always-on voice wake is one of the earliest examples of machine learning on the edge: listening for a keyword such as “Hey Siri” or “OK Google” before waking the rest of the system to determine the next action. If this keyword detection was run on a generic application processor, it could take well over 100mW. Throughout the course of a day this would deplete the smartphone battery. Therefore, the first phones to implement this feature had algorithms ported onto a small DSP which could run at less than 5mW. Nowadays these same algorithms can run on a specialized audio and machine learning DSP in a smart microphone at less than 0.5mW.
Once an edge device is enabled for always-on audio machine learning, it can do more things than speech recognition at low power: contextual awareness such as whether the device is in a crowded restaurant or busy street, ambient music recognition, ultrasonic room recognition, and even recognizing whether someone nearby is shouting or laughing. These types of features will enable new sophisticated use cases that could improve the edge device and benefit the user.
Best performance and energy efficiency for machine learning inference at the edge requires extensive hardware customization, some of the most impactful techniques are collected in Table 1. Implementing these features will improve edge machine learning inference efficiency.
The majority of arithmetic operations needed for neural network inference are matrix-vector multiplies. This is because machine learning models are typically represented as matrices, which get applied to new stimulants represented as vectors. The most common technique to improve edge machine learning inference is to make matrix-vector multiplication very efficient. A fused multiply followed by an accumulate (MAC) is a common way to address this.
Although the training phase is sensitive to numerical precision, the inference phase can achieve near equivalent results with low precision (e.g. 8-bits). Limiting precision can greatly reduce the complexity of the edge computation. For this reason, processor companies such as Intel and Texas Instruments have added limited precision MACs. Texas Instruments’ TMS320C6745 can execute 8 MACs of 8-bits each per cycle. Also, Knowles’ audio DSP supports 16 MACS of 8-bits each per cycle.
Both the training and inference phases put pressure on the memory subsystem. Processor support for wide word widths is often improved to accommodate this. Intel’s more recent high-performance processors have AVX-512 which supports transferring 512-bits per cycle into an array of 64 multipliers. The Texas Instruments 6745 uses a 64-bit bus to increase memory bandwidth. Knowles’ advanced audio processors use a 128-bit bus striking a good balance between large chip area and high bandwidth. Furthermore, audio machine learning architectures (such as RNN or LSTM) often require feedback. This puts additional requirements on chip architecture, since data dependence can stall heavily pipelined architectures.
Although traditional machine learning can work with raw data, audio machine learning algorithms typically perform spectral analysis and feature extraction to feed neural networks. Acceleration of traditional signal processing functions such as FFTs, audio filters, trigonometric functions, and logarithms are necessary for energy efficiency. Subsequent operations often utilize a variety of non-linear vector operations, such as a sigmoid, implemented as a hyperbolic tangent, or rectified linear unit (absolute value function with all negative numbers changed to zero). These sophisticated non-linear operations take many cycles on traditional processors. Single-cycle instructions for these functions also improve energy efficiency of machine learning Audio DSPs.
In summary, advanced processors specialized for both machine learning and audio processing enable real-time always-on edge inference at low-cost while simultaneously maintaining privacy. Energy consumption is kept low through architectural decisions on instruction set support to allow multiple operations per cycle and wider memory buses to maintain high performance at low power. As companies continue to innovate on specialized compute at the edge, the use-cases for machine learning that utilize it will only increase.
Jim Steele is vice president of technology strategy at Knowles Corp.
>> This article was originally published on our sister site, EE Times: “Machine Learning on DSPs: Enabling Audio AI at the Edge.”