Startup Plumerai has built an AI inference engine for Cortex-M microcontrollers which outperforms the standard combination of TensorFlow Lite for Microcontrollers and Arm’s CMSIS-NN kernels. In recent tests performed by the company, Plumerai’s inference engine resulted in 40% lower inference latency and 49% less RAM usage, without reducing prediction accuracy.
Plumerai’s tests also showed that their software beat other popular AI inference engines on the market, for the same 8-bit AI model on the same Cortex-M hardware.
“Last year alone there were around 18 billion arm Cortex-M based chips shipped,” Roeland Nusselder, CEO of Plumerai told EE Times Europe. “Our inference engine allows customers who use those chips to run larger and more accurate AI models, process more frames per second or save more energy, and/or deploy cheaper hardware.”
An AI inference engine is part of the software stack used to convert the AI model to run efficiently on the target hardware. It directs resource management to optimize the way the model is executed.
Plumerai’s demo showed a person detection algorithm running on a Cortex-M7 microcontroller from STMicro. The demo runs at 2 frames per second, is highly accurate and requires less than 300 kB of RAM, the company said. This demo uses Plumerai’s inference engine, together with a binarized model trained with Plumerai’s own algorithms and data pipeline. (Source: Plumerai)
“There’s no downside to using our inference engine, it doesn’t change the model itself, there’s no extra quantization and no pruning,” Nusselder said. “We just massively optimize the way the model is executed.”
Plumerai does not rely on TensorFlow or Arm’s kernels for the most performance-critical layers, instead, the company has developed custom code optimized for the lowest possible latency and memory usage. The company has optimized code for regular convolutions, depthwise convolutions, fully-connected layers, various pooling layers and more.
“To really get the best performance, we do specific optimizations for each layer in a neural network,” Nusselder said. “So for instance, rather than only optimizing convolutions in general, our inference engine makes specific improvements based on all the actual values of layer parameters such as kernel size, strides and padding. As we do not know which neural networks our inference engine we have to deal with, we made these optimizations together with the compiler.”
Memory usage is optimized via a smart offline memory planner that analyzes the memory access patterns of each layer of the network.
Full stack innovation
Plumerai’s inference engine was originally developed as part of its offering for binarized neural networks (BNNs) – where weights and activations are quantized to 1-bit to save power. Nusselder explained that the engine had 8-bit capability because first and last layers of BNNs are often retained in 8-bit. Seeing a gap in the market, the company expanded its inference engine to work with standard 8-bit models also.
Roeland Nusselder (Source: Plumerai)
“We discovered that to build the most advanced AI for embedded devices, we have to work on the full stack,” he said. Plumerai’s stack comprises: training algorithms and model architectures, the inference engine, the data pipeline and a hardware IP core optimized for 8-bit and 1-bit AI inference.
“Training data is important for any deep learning model, but it’s extra important for tiny deep learning models,” Nusselder said. “It’s not just about the quantity, but also the quality of the data. Tiny deep learning models have limited information capacity, so you have to be very strict in what you want to teach the model.”
Plumerai has therefore been working on tools to generate “perfect” training datasets, as well as tools to optimise where models need more data, which data needs to be over-sampled, and analyse why models make certain mistakes.
The hardware IP is currently up and running on an FPGA prototype system (Plumerai is not seeking to become an ASIC company, Nusselder said, but they may license this IP in the future).
Plumerai’s inference engine software (and full software stack) is also available for Arm Cortex-A and RISC-V architectures.
>> This article was originally published on our sister site, EE Times Europe.
- How to implement AI of Things (AIoT) on MCUs
- Microcontroller architectures evolve for AI
- Squeezing AI models into microcontrollers
- Microcontrollers take on growing role in edge AI
For more Embedded, subscribe to Embedded’s weekly email newsletter.