Much has been written about using AI for increasingly smart vehicles. But how do you take a neural network (NN) developed on a server farm and squeeze it into resource-constrained embedded hardware in production cars? This article explores how we should empower automotive production AI R&D engineers to refine NNs throughout the process of taking NNs from prototype to production, rather than today’s process of handing over an NN to an embedded software team too early.
“We need to enable production AI teams to leverage their knowledge of NNs during the software porting process if we are to make the best use of embedded hardware resources” (Source: Marton Feher, SVP Hardware Engineering, AImotive)
Embedded AI: Embedded software – but not as we know it
With any embedded software destined for deployment in volume production, an enormous amount of effort goes into the code once the implementation of its core functionality has been completed and verified. This optimization phase is all about minimizing memory, CPU and other resources needed so that as much as possible of the software functionality is preserved, while the resources needed to execute it are reduced to the absolute minimum possible.
This process of creating embedded software from lab-based algorithms enables production engineers to cost-engineer software functionality into a mass-production ready form, requiring far cheaper, less capable chips and hardware than the massive compute datacenter used to develop it. However, it usually requires the functionality to be frozen from the beginning, with code modifications only done to improve the way the algorithms themselves are executed. For most software, that is fine: indeed, it enables a rigorous verification methodology to be used to ensure the embedding process retains all the functionality needed.
However, when embedding NN-based AI algorithms, that can be a major problem. Why? Because by freezing the functionality from the beginning, you are removing one of the main ways in which the execution can be optimized.
What is the problem?
There are two fundamentally different ways we can tackle the task of porting a complex NN from an unconstrained, resource-rich NN training environment in the lab to a tightly-constrained embedded hardware platform:
- Optimize the code executing the NN
- Optimize the NN itself
When an embedded software engineer sees a performance issue, such as a memory bandwidth bottleneck or poor utilization of the underlying embedded hardware platform, conventional embedded software techniques would encourage you to dig deep into the low-level code and find the problem.
That is reflected in the many advanced and sophisticated tools available today for embedded MCUs and DSPs. They enable you to get to the lowest level of what is happening in the software and identify and improve the execution of the software itself – hopefully without changing its functionality.
For NN’s, optimization is simply different from conventional embedded software – at least it is if you want to achieve the best possible results with the available hardware resources. With NNs, improvements are achieved by some combination of changing the topology NN itself (how the various layers of the NN are connected, and what each layer does) and re-training it using updated constraints and inputs. That is because the functionality is not defined by the NN “software”, but the targets and constraints applied during training to create the weights that define the NN’s final behaviour.
So, when undertaking the embedding process for NNs, you need to freeze the target performance of the NN, not how it achieves it. If you then constrain the NN topology as well from the start of the embedding process, you are removing the very tools the production engineer needs to improve performance.
That means you need new and different tools to do the task of porting NNs from the lab to embedded platforms. And low-level software engineers cannot do the job – you need the AI engineers to adapt the NN and its training based on the performance information the tools are providing you with. That is new: no longer can the R&D engineer say “job done” when they hand over a trained NN to the production engineers!
A different approach
By adopting a development workflow that puts the AI R&D engineer at the centre of the embedded software porting task, far superior results can be achieved for any chip. Using layer-centric analysis complemented by fast turnaround in minutes from compiling a modified convolutional neural network (CNN) to seeing accurate performance results for the target neural processor unit (NPU), developers can realize gains of 100% or more using the same underlying hardware. That is because modifying the CNN itself, rather than modifying just the code used to execute the same CNN, gives the AI engineer far more flexibility for identifying and implementing performance improvements.
When developing our aiWare NPU, AImotive used our own AI engineers’ experience of doing the porting process to multiple different chips with a wide range of NPU capabilities. We wanted to find better ways to help our own AI engineers do this task, so when developing our requirements for both aiWare NPU itself and the aiWare Studio tools supporting it, we identified several desirable features not seen on the hardware platforms we had used in the past:
- Highly deterministic NPU architecture, so that timing is very predictable
- Accurate layer-based (not timing-based or low-level code-based) performance estimation, so that any AI R&D engineer can see the impact of changing their training criteria (such as adding or changing scenarios used, or modifying target KPIs) and/or NN topologies quickly
- Accurate offline performance estimation, so that all NN optimization can be performed before the first hardware is available (because first prototypes are always scarce!)
click for full size image
Figure 1: aiWare Studio enables users to optimize their NNs, rather than the code used to execute them. That gives AI designers far greater flexibility to achieve great results faster. (Source: AImotive)
The result is a set of tools that enable AI R&D engineers to do almost all optimization for target hardware within the lab environment and demonstrate performance within 5% of final target hardware – all before anyone has even seen the hardware.
Of course, it is vital to measure the final hardware when chips and hardware prototypes become available. The availability of real-time hardware profiling capabilities in this kind of development environment lets engineers access a series of deeply embedded hardware registers and counters within NPUs supported by such tools. While the silicon overhead is minimal (since many NPUs are dominated by memory, not logic), these capabilities can enable unprecedented, non-intrusive measurement of real-time performance during execution. This can then be used to compare directly against the offline performance estimator results, to confirm accuracy.
click for full size image
Figure 2: Using embedded registers and counters, aiWare Studio can accurately measure final chip performance compared to the offline estimated results, usually to within 1%-5%. (Source: AImotive and Nextchip Co. Ltd)
This new approach offers the automotive industry a new and better way to develop, optimize and deploy AI in production vehicles. Using synergistic NPU hardware and tools, AI engineers can design, implement and optimize better CNNs for automotive applications.
|Tony King-Smith is Executive Advisor at AImotive. He has more than 40 years’ experience in semiconductors and electronics, managing R&D strategy as well as hardware and software engineering teams for a number of multi-nationals including Panasonic, Renesas, British Aerospace and LSI Logic. He is also well-known globally as an inspirational technology marketer from his role as CMO for leading semiconductor IP vendor Imagination Technologies. Tony is based near London, UK.|
- AI chips soaring, AI software gaining spotlight
- Training AI models on the edge
- Applying machine learning in embedded systems
- AI at the edge: what to look for in 2021
- Microcontrollers take on growing role in edge AI
- Processor-in-memory chip speeds AI computations
- Edge AI challenges memory technology
- subscribe to Embedded’s weekly email newsletter
For more Embedded, subscribe to Embedded’s weekly email newsletter.