Machine learning has evolved rapidly from an interesting research topic to an effective solution for a wide range of applications. Its apparent effectiveness has rapidly accelerated interest from a growing developer base well outside the community of AI theoreticians. In some respects, machine learning development capabilities are evolving to a level of broad availability seen with other technologies that build on strong theoretical foundations. Developing a useful, high-accuracy machine-learning application is by no means simple. Still, a growing machine-learning ecosystem has dramatically reduced the need for a deep understanding of the underlying algorithms and made machine-learning development increasing accessible to embedded systems developers more interested in solutions than theory. This article attempts to highlight just some of the key concepts and methods used in neural network model development – itself an incredibly diverse field and just one type of practical machine-learning methods becoming available to embedded developers.
As with machine learning, any method based on deep theory follows a familiar pattern of migration from research to engineering. Not too long ago, developers looking to achieve precise control of a three-phase AC induction motor needed to work through their own solutions to the associated series of differential equations. Today, developers can rapidly implement advanced motion-control systems using libraries that package complete motor-control solutions using very advanced techniques like field-oriented control, space vector modulation, trapezoidal control, and more. Unless they face special requirements, developers can deploy sophisticated motor-control solutions without a deep understanding of the underlying algorithms or their specific math methods. Motion-control researchers continue to evolve this discipline with new theoretical techniques, but developers can developing useful applications, relying on libraries to abstract the underlying methods.
In some respects, machine learning has reached a similar stage. While machine-learning algorithm research and machine-learning-specific hardware advances continue to achieve dramatic advances, application of these algorithms has evolved to become a practical engineering method if approached with a suitable understanding of its associated requirements and current limitations. In this context, machine learning can deliver useful results, requiring less expertise in advanced linear algebra than an appreciable understanding of the target application data – and a willingness to accept a more experimental approach to development than they have experienced in conventional software development. Engineers interested in the foundations of machine learning will find their appetite for details fully satisfied. Yet, those with little time or interest in exploring theory will find a growing machine-learning ecosystem that promises to simplify development of useful machine-learning applications.
Engineers can find optimized libraries able to support broad classes of machine learning including unsupervised learning, reinforcement learning, and supervised learning. Unsupervised learning can reveal patterns in large amounts of data, but this method cannot specifically label those patterns as belonging to particular class of data. Although this article does not address unsupervised learning, these techniques will likely prove important in applications such as the IoT to reveal outliers in data sets or indicate the existence of departures from data trends. In an industrial application, for example, a statistically significant departure from the norm in sensor readings from a group of machines might serve as an indicator of potential failure of machines in that group. Similarly, a significant departure from a number of measured performance parameters in a large-scale distributed application such as an IoT application might reveal hacked devices that seem to be otherwise operating satisfactorily in a network of hundreds or thousands of devices.
Reinforcement learning provides a method for an application to effectively learn by experiment, using positive feedback (reward) to learn successful responses to events. For example, a reinforcement learning system that detects anomalous sensor readings from a group of machines might try to return those readings to normal by taking different actions such as increasing coolant flow, reducing room temperature, reducing machine load, and the like. Having learned which action resulted in success, the system could more quickly perform that same action the next time the system sees those same anomalous readings. Although this article does not address this method, reinforcement learning will likely find growing use in large-scale complex applications (such as the IoT) where all realized operating states cannot be cost-effectively anticipated.
Supervised learning methods eliminate the guesswork associated with identifying what set of inputs correspond to which specific state (or object). In this approach, developers explicitly identify combinations of input values, or features, that correspond to a particular object, state, or condition. In the hypothetical machine example, engineers would represent the problem of interest through a set of n features, x – for example, different sensor inputs, machine running time, last service date, machine age, and other measurable values. Based on their expertise, the engineers then create a training data set – multiple instances of these feature vectors (x1 x2 … xn ), each with n observations associated with the known output state, or label y :
(x11 , x12 , … x1n ) ⇒ y1
(x21 , x22 , … x2n ) ⇒ y2
(x31 , x32 , … x3n ) ⇒ y3
Given this training set with known relationship between measured feature values and corresponding labels, developers train a model (a system of equations) able to produce the expected label yk for each feature vector (x1k x2k … xnk ) in the training set. During this training process, the training algorithm uses an iterative approach to minimize the difference between predicted labels and their actual labels by adjusting the parameters of the system of equations that make up the model. Each pass through the training set, called an epoch, produces a new set of parameters, a new set of predicted labels associated with those parameters, and the associated difference, or loss.
Plotted against the loss, the set of parameter values produced at each iteration is a multidimensional surface with some minimum. This minimum corresponds to the closest agreement between actual labels provided in the training set and predicted label inferred by the model. Thus, the objective in training is to adjust a model's internal parameters to reach that minimum loss value, using methods designed basically to seek the fastest “downhill” path toward this minimum. On a multidimensional surface, the direction that leads to that best downhill path can be determined by calculating the slope at each parameter with respect to the other parameters – that is, each parameter's partial derivative. Using matrix methods, training algorithms typically use this approach, called gradient descent, to adjust model parameter values after running all the training data or subsets of training data through the model at each epoch. To minimize the magnitude of this adjustment, training algorithms adjust each step size by some value, called the learning rate, which helps the training process converge. Without a controlled learning rate, gradient descent could overshoot the minimum due to an excessively large adjustment of the model parameters. After the model achieves (or acceptably converges toward) the minimum loss, engineers then test the model's ability to predict labels associated with data sets specifically held out of the training set for testing purposes.
Once trained and evaluated, a suitable model can be deployed in production as an inference model to predict labels for actual application data. Note that inference generates a set of probabilities for each label used in training. Thus, a model trained with feature vectors labeled as “y1 ,” “y2 ,” or “y3 ” might generate inference results such as “y1 : 0.8; y2 : .19; y3 : .01″ when presented with a feature vector associated with y1 . Additional software logic would monitor the output layer to select the label with the best likelihood value and pass that selected label to the application. In this way, an application can use a machine-learning model to recognize an individual or other data pattern and take appropriate action.
Neural network development
Creating accurate inference models is of course the payoff to a supervised learning process able to draw on a wide range of underlying model types and architectures. Among these model types, neural networks have rapidly gained popularity for their success in image recognition, natural language processing, and other application areas. In fact, after advanced neural networks dramatically outperformed earlier algorithms in image recognition, neural network architectures have become the de facto solution for these classes of problems. With the availability of hardware including GPUs able to perform the underlying calculations quickly, these techniques rapidly became broadly accessible to algorithm developers and users. In turn, the availability of effective hardware platforms and widespread acceptance of neural networks have motivated development of a wide range of developer-friendly frameworks including Facebook's Caffe2, H2O, Intel's neon, MATLAB, Microsoft Cognitive Toolkit, Apache MXNet, Samsung Veles, TensorFlow, Theano, and PyTorch. As a result, developers can easily find a suitable environment for evaluating machine learning in general and neural networks in particular.
Development of neural networks starts with deployment of a framework using any number of available installation options. Although dependencies are typically minimal, all of the popular frameworks are able to take advantage of GPU-accelerated libraries. Consequently, developers can dramatically speed calculations by installing the NVIDA CUDA toolkit and one or more libraries from the NVIDIA Deep Learning SDK such as NCCL (NVIDIA Collective Communications Library) for multi-node/multi-GPU platforms or NVIDIA cuDNN (CUDA Deep Neural Network) library. Operating in their GPU-accelerated mode, machine-learning frameworks take advantage of cuDNN's optimized implementations for standard neural-network routines including convolutions, pooling, normalization, and activation layers.
Whether using GPUs or not, the installation of a framework is simple enough, typically requiring a pip install for these Python-based packages. Installing TensorFlow, for example, uses the same Python install method as with any Python module:
pip3 install –upgrade tensorflow
(or just pip for Python 2.7 environments)
In addition, developers may want to add other Python modules to speed different aspects of development. For example, the Python pandas module provides a powerful tool for creating needed data formats, performing different data transformations, or just handling the various data wrangling operations often required in machine-learning model development.
Experienced Python developers will typically create a virtual environment for Python development, and the popular frameworks are each available through Anaconda, for example. Developers using container technology to simplify devops can also find suitable containers built with their framework of choice. For example, TensorFlow is available in Docker containers on dockerhub in CPU-only and GPU-supported versions. Some frameworks are also available in Python wheel archives. For example, Microsoft provides Linux CNTK wheel files in both CPU and GPU versions, and developers can find wheel files for installing TensorFlow on a Raspberry Pi 3.
While setting up a machine-learning framework has become simple, the real work begins with selection and preparation of the data. As described earlier, data plays a central role in model training – and thus in the effectiveness of an inference model. Not mentioned earlier is that fact that training sets have typically comprised hundreds of thousands if not millions of feature vectors and labels to achieve sufficient accuracy levels. The massive size of these data sets make casual inspection of input data either impossible or largely ineffective. Yet, poor training data translates directly to reduced model quality. Incorrectly labeled feature vectors, missing data, and, paradoxically, data sets that are “too” clean can result in inference models unable to deliver accurate predictions or generalize well. Perhaps worse for the overall application, selection of a statistically non-representative training set implicitly biases the model away from those missing feature vectors and the entities they represent. Because of the critical role of training data and the difficulty in creating it, the industry has evolved large numbers of labeled data sets available from sources such as the UCI Machine Learning Repository, among others. For developers simply exploring different machine-learning algorithms, Kaggle datasets often provide a useful starting point.
For a development organization working on its unique machine-learning application, of course, model development requires its own unique data set. Even with a sufficiently large pool of available data, the need to label the data can introduce difficulties. In practice, the process of labeling data is by definition a human-centric activity. As a result, creating a system for accurately labeling data is a process in itself, requiring a combination of psychological understanding of how humans interpret instructions (such as how and what to label) and technological support to speed data presentation, labeling, and validation. Companies such as Edgecase, Figure Eight, and Gengo combine expertise in the broad requirements of data labeling, providing services designed to turn data into useful training sets for supervised learning. With a qualified set of labeled data in hand, developers then need to split the data into a training set and a test set – typically using a 90:10 split or so – taking care that the test set is a representative but distinct set of data from that in the training set.
In many ways, creating suitable training and test data can be more difficult than creating the actual model itself. With TensorFlow, for example, developers can build a model using built-in model types in TensorFlow's Estimator class. For example, a single call such as:
classifier = tf.estimator.DNNClassifier( feature_columns=this_feature_column, hidden_units=[4,], n_classes=2)
uses the built-in DNNClassifier class to automatically create a basic fully connected neural network model (Figure 1) comprising an input layer with three neurons (the number of supported features), one hidden layer with four neurons, and an output layer with two neurons (the number of supported labels). Within each neuron, a relatively simple activation function performs some transformation on its combination of inputs to generate its output.
Figure 1. Although the simplest neural network comprises an input layer, hidden layer, and output layer, useful inference relies on deep neural network models comprising large numbers of hidden layers each comprising large numbers of neurons. (Source: Wikipedia)
To train the model, the developer would simply call the train method in the instantiated estimator object – classifier.train(input_fn=this_input_function) in this example – and using the TensorFlow Dataset API to provide properly formed data through the input function (this_input_function in this example). Such preprocessing, or “shaping,” is needed to convert input data streams to matrices with dimensions (shapes) expected by the input layers, but this preprocessing step can also include data scaling, normalization, and any number of transformations required for a particular model.
Neural networks lie at the heart of many advanced recognition systems, but practical applications are based on neural networks with significantly more complex architectures than this example. These “deep neural network” architectures feature many hidden layers, each with large numbers of neurons. Although developers can simply use the built-in Estimator classes to add more layers with more neurons, successful model architectures tend to mix different types of layers and capabilities.
For example, AlexNet, the convolutional neural network (CNN), or ConvNet, that ignited use of CNNs in the ImageNet competition (and in many image recognition applications since then) had eight layers (Figure 2). Each layer comprised a very large number of neurons, staring with 253440 in the first layer and continuing with 186624, 64896, 64896, 43264, 4096, 4096, and 1000. Rather than work with a feature vector of observed data, ConvNets scan an image through a window (n x n pixel filter), moving the window a few pixels (stride) and repeating the process until the image has been fully scanned. Each filter result passes through the various layers of the ConvNet to complete the image-recognition model.
Figure 2. AlexNet demonstrated the use of deep convolutional neural network architectures in reducing error rates in image recognition. (Source: ImageNet Large Scale Visual Recognition Competition)
Even with that “simple” configuration, use of a CNN provided a dramatic decrease in top-5 error in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) compared to the leading solution just the year before. (Top-5 error is a common metric that indicates the percentage of inferences that did not include the correct label among the model's top five predictions for possible labels for that input data.) In subsequent years, leading entries featured a dramatic increase in the number of layers and equally dramatic reduction in top-5 error (Figure 3).
Figure 3. Since AlexNet dramatically reduced ImageNet top-5 error rates in 2012, the top performers in the ILSVRC featured significantly deeper model architectures. (Source: The Computer Vision Foundation)
Developers can use any of the popular frameworks to create ConvNets and other complex, custom models. With TensorFlow, the developer builds a ConvNet model layer by layer using methods calls to build a convolution layer, aggregating results with a pooling layer, normalizing the result – and typically repeating that combination to create as many convolution layers as needed. In fact, in a TensorFlow demo of a ConvNet designed to complete the CIFAR-10 classification set, those first three layers are built using three key methods: tf.nn.conv2d , tf.nn.max_pool , and tf.nn.lrn :
# conv1 with tf.variable_scope('conv1') as scope: kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64], stddev=5e-2, wd=None) conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME') biases = _variable_on_cpu('biases', , tf.constant_initializer(0.0)) pre_activation = tf.nn.bias_add(conv, biases) conv1 = tf.nn.relu(pre_activation, name=scope.name) _activation_summary(conv1) # pool1 pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool1') # norm1 norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm1')
Developers train a completed TensorFlow model using a train method shown in Listing 1. Here, train_op references the cifar10 class object's train method to perform training until stop conditions defined elsewhere by the developer are satisfied. The cifar10 train method handles the actual training cycle including loss calculation, gradient descent, parameter update, and applying an exponential decay function to the learning rate itself (Listing 2).
def train(): """Train CIFAR-10 for a number of steps.""" with tf.Graph().as_default(): global_step = tf.train.get_or_create_global_step() # Get images and labels for CIFAR-10. # Force input pipeline to CPU:0 to avoid operations sometimes ending up on # GPU and resulting in a slow down. with tf.device('/cpu:0'): images, labels = cifar10.distorted_inputs() # Build a Graph that computes the logits predictions from the # inference model. logits = cifar10.inference(images) # Calculate loss. loss = cifar10.loss(logits, labels) # Build a Graph that trains the model with one batch of examples and # updates the model parameters. train_op = cifar10.train(loss, global_step) class _LoggerHook(tf.train.SessionRunHook): """Logs loss and runtime.""" def begin(self): self._step = -1 self._start_time = time.time() def before_run(self, run_context): self._step += 1 return tf.train.SessionRunArgs(loss) # Asks for loss value. def after_run(self, run_context, run_values): if self._step % FLAGS.log_frequency == 0: current_time = time.time() duration = current_time - self._start_time self._start_time = current_time loss_value = run_values.results examples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / duration sec_per_batch = float(duration / FLAGS.log_frequency) format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f ' 'sec/batch)') print (format_str % (datetime.now(), self._step, loss_value, examples_per_sec, sec_per_batch)) with tf.train.MonitoredTrainingSession( checkpoint_dir=FLAGS.train_dir, hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps), tf.train.NanTensorHook(loss), _LoggerHook()], config=tf.ConfigProto( log_device_placement=FLAGS.log_device_placement)) as mon_sess: while not mon_sess.should_stop(): mon_sess.run(train_op)
Listing 1. In this example from the TensorFlow CIFAR-10 ConvNet demo, a TensorFlow session mon_sess run method performs the training cycle, referencing the loss function and other training parameters included in the cifar10 instance itself. (Source: TensorFlow)
# Decay the learning rate exponentially based on the number of steps. lr = tf.train.exponential_decay(INITIAL_LEARNING_RATE, global_step, decay_steps, LEARNING_RATE_DECAY_FACTOR, staircase=True) tf.summary.scalar('learning_rate', lr) # Generate moving averages of all losses and associated summaries. loss_averages_op = _add_loss_summaries(total_loss) # Compute gradients. with tf.control_dependencies([loss_averages_op]): opt = tf.train.GradientDescentOptimizer(lr) grads = opt.compute_gradients(total_loss) # Apply gradients. apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
Listing 2. In the TensorFlow CIFAR-10 ConvNet implementation, the cifar10 object performs gradient descent, calculates loss with exponential decay, and updates model parameters. (Source: TensorFlow)
The TensorFlow approach provides significant power and flexibility, but the code can be arguably unwieldly. The good news is that a perhaps more intuitive neural network API, Keras, runs on top of TensorFlow. Using Keras, developers can build a CIFAR-10 ConvNet in a few lines of code, adding required layers, activation functions, aggregation (pooling), among others (Listing 3).
model = Sequential()model.add(Conv2D(32, (3, 3), padding='same', input_shape=x_train.shape[1:]))model.add(Activation('relu'))model.add(Conv2D(32, (3, 3)))model.add(Activation('relu'))model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Dropout(0.25))model.add(Conv2D(64, (3, 3), padding='same'))model.add(Activation('relu'))model.add(Conv2D(64, (3, 3)))model.add(Activation('relu'))model.add(MaxPooling2D(pool_size=(2, 2)))model.add(Dropout(0.25))model.add(Flatten())model.add(Dense(512))model.add(Activation('relu'))model.add(Dropout(0.5))model.add(Dense(num_classes))model.add(Activation('softmax'))
Listing 3. Keras provides a intuitive approach for building a ConvNet model layer by layer. (Source: Keras)
After the model is defined in Keras, developers call the model's compile method to specify the desired loss-calculation algorithm and optimizer algorithm such as gradient descent. To train the model, developers call the model's fit method. Alternatively, the model's fit_generator method provides an even simpler training method that uses Python's generator functionality to train the model on batches of data generated from the Keras preprocessing classes.
To deploy a model to a target device such as an edge device, developers can typically export the model from their development environment. With TensorFlow, for example, developers can use the TensorFlow freeze_graph.py utility to export the model in pb format – a serialized format based on Google's protocol buffers. In the target device, developers can use the TensorFlow C++ API to create a TensorFlow runtime session, load the pb file, and run it with the application's input data. For target platforms able to work with containers, developers can generally find Docker containers for their favorite framework on Docker Hub, add their model execution application, and export the application container to their target platform. In practice, the process of redeploying a model requires additional care in replacing functionality built to pull in training data from the development environment with functionality optimized to deal with data in the production environment.
TensorFlow, Keras, and the other neural network frameworks significantly simplify the task of building complex models. Creating a model that effectively meets design objectives is another story entirely. Unlike traditional software development, delivering an effective model can require a greater amount of guesswork in trying different solutions and “seeing what sticks.” Researchers are actively investigating algorithms able to create optimized models but a general theorem of neural network optimization remains elusive. For that matter, there are no general best practices or heuristics for creating models: Each application brings its unique data characteristics on the front end and requirements for performance, accuracy, and power consumption on the back end.
One promising approach is automated model-building tools such as Google's Cloud AutoML, which uses transfer learning and reinforcement learning to find good model architectures in specific domains. To date, Google has announced one product, AutoML Vision, which is in limited-availability alpha as of this writing. Even so, the emergence of automated model building tools is inevitable if not immediately forthcoming. AutoML-type tools remain an active area of research as AI tool vendors vie for dominance in machine learning with a potentially game-changing capability.
In the meantime, cloud-service providers and framework developers continue to advance more immediate capabilities for simplifying machine-learning solutions development. Developers can use a tool such as TensorFlow debugger to gain insight into the internal states of models and TensorBoard to more easily visualize and explore complex model topologies and node interactions. For example, TensorBoard provides an interactive view of loss vs epoch, which provides an early gauge of model effectiveness and learning rate suitability (Figure 4).
Figure 4. TensorBoard helps developers visualize model internal structures and the training process, displaying the loss function (here, using cross entry for the loss function) vs epoch. (Source: TensorFlow)
Ultimately, however, finding the most appropriate architecture and configuration is a combination of experience and trial-and-error. Even the most experienced neural network investigators advise that the way to find the best machine-learning architecture and specific topology is to try them and see which works best. In this sense, neural network development differs significantly from conventional application development. Rather than expect that coding a model is the end game, experienced machine-learning developers and neural-network developers in particular build each model as an experiment and run multiple experiments with different model architectures and configurations to find the one best suited to the application.
Developers can nevertheless accelerate the model development process by taking advantage of the large number of prebuilt models available from framework providers and the open-source community. Designed with generic data sets or even application-specific data sets, prebuilt models will rarely be optimal for developers' own models. Using transfer learning methods (See Embedded's article, “Transfer learning for the IoT“), however, these models not only accelerate development but often provide better results than possible in using the developers' own data set to train a naive model. In fact, developers can find open-source pretrained models from sources such as the Caffe2 Model Zoo, Microsoft CNTK Model Gallery, Keras, TensorFlow, and others.
Even with an expanding number of available model-creation techniques, however, a critical constraint in model selection is the performance capability of the target platform. A model designed to run on a multi-GPU system will simply not run effectively on a system based on a general-purpose processor. Without appropriate hardware support, a general-purpose processor cannot rapidly complete the matrix multiplication calculations that dominate machine learning algorithms (Figure 5).
Figure 5. General matrix multiply (gemm) calculations dominate machine learning in general and neural-network architectures in particular. (Source: IBM)
Developers can nevertheless develop neural network models able to target systems as simple as a Raspberry Pi using the distribution of TensorFlow for Raspberry Pi mentioned earlier. More generally, Arm's Compute Library provides machine-learning functions optimized for Arm Cortex-A-series CPUs such as that used in the Raspberry Pi 3. Developers can even create modest but effective neural networks for Arm Cortex-M7 MCUs using Arm's CMSIS-NN library. In fact, Arm has described use of a standard 216 MHz Cortex-M7 running on a NUCLEO Mbed board able to complete inference in reasonable time on a pre-built Caffe ConvNet for CIFAR-10 (Figure 6).
Figure 6. Arm demonstrated Caffe CIFAR-10 ConvNet models able to achieve reasonable inference times running on a standard Arm Cortex-M7 processor. (Source: Arm)
FPGAs can offer a significance performance upgrade and rapid development platform. For example, Lattice Semiconductor's SensAI platform uses a neural network compiler able to compile TensorFlow pb files and others onto Lattice neural-network IP cores for implementation on Lattice FPGAs.
Specialized AI devices go further with dedicated hardware designed to accelerate machine-learning applications targeting mass market segments. Hardware devices such as Intel Movidius Neural Compute Stick, NVIDIA Jetson TX2 modules, and Qualcomm Snapdragon modules, among many others, allow developers to embed high-performance machine-learning algorithms in a wide range of systems.
Architectures specialized for AI applications seek to reduce the CPU:memory bottleneck. For example, the AI accelerator chip that IBM described at the 2018 VLSI Circuits Symposium combines processing elements designed to speed matrix multiplication with a “scratchpad memory” hierarchy used to reduce returns to off-chip memory (Figure 7). Emerging advanced AI chips similarly utilize various methods to merge logic and memory in microarchitectures needed to speed the kinds of operations underlying AI applications.
Figure 7. Emerging AI chip architectures seek to reduce the effects of the CPU:Memory bottleneck using techniques such as a scratchpad memory hierarchy described in an IBM AI chip. (Source: IBM)
Specialized frameworks and architectures
Specialized hardware is essentially for fully realizing the potential of machine learning, and the IC industry continues to offer more powerful AI processors. At the same time, framework developers and machine-learning algorithm experts continue to find ways to optimize frameworks and neural network architectures to more effectively support resource-constrained platforms including smartphones, IoT edge devices, and other systems with modest performance capabilities. While Apple Core ML and Android Neural Networks API provide solutions for their respective environments, frameworks such as TensorFlow Mobile (and its emerging evolution as TensorFlow Lite) provide more general solutions.
Model architectures themselves continue to evolve to address resource limitations of embedded devices. SqueezeNet literally squeezes elements of the model architecture to dramatically reduce model size and parameters. MobileNet takes a different architectural approach to deliver top-1 accuracy comparable to other methods but with far fewer expensive multiple-add operations (Figure 8). The machine-learning company Neurala uses its own novel approach to deliver small-footprint models able to deliver high accuracy results (See Embedded's article, “Bringing machine learning to the edge: A Q&A with Neurala's Anatoli Gorshechnikov“ ).
click for larger image
Figure 8. Specialized model architectures such as Google's MobileNet provide competitive accuracy levels with reduced matrix multiply-add operations needed for resource-constrained embedded applications. (Source: TensorFlow)
The convergence of algorithms and tools has reduced the need to construct machine-learning models based on first principles. Increasingly, developers can draw on prebuilt models to address their own requirements using increasingly powerful frameworks and tools. As a result, developers can deploy inference models on popular platforms to perform simple tasks such as image recognition. Extending these same techniques to create useful high-performance production models is by no means simple but eminently accessible to any developer. As specialized AI hardware evolves for low-power systems, machine learning will likely become a familiar tool in the embedded developers' solution set.