Managing tradeoffs in ML model deployment -

Managing tradeoffs in ML model deployment

Where to place your machine learning code — in the cloud, on an edge device, or on-premise — always involves tradeoffs. Here are some tips.

Engineers frequently have to make important decisions as to where to place their code: in the cloud, on an edge device, or on-premise. This decision always involves tradeoffs — accounting for the right combination of software, firmware, development tools and hardware that are available for each set of circumstances. On Samsara’s machine learning and computer vision (ML/CV) team, we build models and develop algorithms that help our customers improve the safety, efficiency, and sustainability of their operations. For instance, building applications to detect and alert upon risky driving behavior in real-time and ultimately, reduce the frequency of road accidents.

The operational environments of industries such as transportation, warehousing, or manufacturing pose unique constraints when looking to build ML solutions. For example, remote locations might be bottle-necked with limited connectivity or have outdated technology systems that aren’t able to run the latest-and-greatest models. These constraints, coupled with the safety-critical aspect of these applications, demand low latency, compute-efficient ML inference since round-trip network latency and spotty cellular coverage would eliminate the possibility of implementing these features in the cloud entirely. Thus, in addition to guaranteeing model accuracy, models must run within the stricter compute, memory, and latency bounds associated with edge hardware platforms.

As you can imagine, there are a number of tradeoffs to analyze and consider when choosing models for this type of edge deployment. Here are some common ones you’ll likely encounter, and how to approach them.

First off, you have to consider the tradeoff between compute throughput and accuracy for your ML engines. Again, with spotty cellular network coverage, you can’t implement everything in the cloud and trust that the data could be delivered reliably. In the case of in-vehicle advanced driver-assistance systems (ADAS), you also can’t have bulky cameras or processors obstructing a vehicle’s dashboard. You need to strike a balance accounting for this tradeoff: a more compact platform (for example, a processor similar to those used in smartphones) with specialized system-on-chip hardware that can handle the image and signal processing while still leaving plenty of processing room for ML models to run effectively.

With this more compact platform, you’ll have to consider your power budget, which is especially the case for any mobile-based application. The more watts of power you consume to run your programs, the more heat energy you will have to dissipate and the more drain there will be on your batteries. Certain hardware co-processors support certain instruction sets and are very power efficient per unit of computation. However, not all mathematical operations can be accurately framed in those instruction sets. In those cases, you have to fallback to more general-purpose compute platforms (like GPUs and CPUs), which support more math ops, but are more power hungry.

Mobile-friendly architectures are designed to take advantage of hardware acceleration (e.g. DSPs) and can reduce the overall model size and memory consumption, but still provide good enough accuracy for the product application you are using it for. Among these architectures, you’re again faced with a series of decisions, including model accuracy/latency tradeoffs and whether to build your own AI solutions or leverage external AI service providers to train and test your ML models.

Next, it’s important to think about how your model gets integrated into the hardware of choice. Because all processors have different instruction sets that can favor specific operations, it helps to review the documentation from each hardware platform to see how these benefits impact your particular code. Each deployment environment comes with its own set of built in idiosyncrasies. For instance, tflite, TensorRT, SNPE etc. have different sets of supported ops, all of which differ slightly. Whatever chipset you end up with, you still have to shoehorn all your math computations into the eventual hardware that will execute those computations.

One issue you might run into is that the deployment environments may not support all network ops and layers that the network was trained with. Additionally, some operations don’t have hardware accelerated implementations, forcing you to run these elements on the CPU, which could create memory and performance bottlenecks. Some of these incompatibilities will need to be addressed during the training process by modifying the model architecture itself, while others will need to be addressed when translating the model into the hardware compatible format.

One final step is to benchmark the final model version and compare the performance characteristics with original specifications. You will have to get creative and slim down your model so that it can run at low latencies. This includes removing model ops and replacing sub-graphs of incompatible operations with ones that are supported by the hardware to run faster. Other strategies include channel pruning, layer folding, and weight quantization.

At the end of the day, sometimes you’ll be able to get your models to run on both a device and in the cloud. But when you are limited by the underlying hardware performance characteristics, network latencies and accuracy requirements, it behooves us to think about where and how to run the model. Segmenting model execution to run on edge devices or in backend services on the cloud is still more an art than science. A good product will integrate a deep understanding of the solution capabilities and customer needs, the limitations of the hardware, and the balancing act in crafting a model that plays to those needs while respecting the physical constraints.

— Sharan Srinivasan and Brian Tuan are software engineers on the machine learning and computer vision engineering team at Samsara, a global connected operations cloud company based in San Francisco. At Samsara, Srinivasan and Tuan are responsible for tackling various machine learning for computer vision challenges with Tensorflow, OpenCV, Pyspark, and Go in order to build models that run on the edge.

>> This article was originally published on our sister site, EE Times.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.