Machine learning platform speeds optimization of vision systems

October 26, 2016

Max the Magnificent-October 26, 2016

When people are talking about deep neural networks (DNNs), deep learning, and embedded vision, they typically start with the idea of defining a network architecture or structure. It's not so long ago that we could only supported linear networks with a very limited number of layers between the input and output stages. By comparison, today's network technologies -- like Google's TensorFlow -- support multiple inputs, multiple outputs, and multiple layers per level.

(Source: Max Maxfield)

TensorFlow is incredibly powerful, but defining a TensorFlow architecture by hand is akin to writing a complex piece of software in assembly language. Thus, companies like Bonsai are working on raising the level of abstraction, thereby empowering more developers to integrate richer intelligence models into their work (see Unlocking the power of AI for all developers). Once the network structure has been defined, the next step is to train it and generate a new version with 32-bit floating-point coefficients ("weights"). Assuming we are creating some sort of embedded vision image processing application, this process -- which may employ hundreds of thousands or millions of categorized images -- can be depicted at a high level as shown below.

(Source: Max Maxfield)

After the network has been trained, the next step is to prepare it for deployment, which depends on the target platform. If we assuming a performance-limited, power-conscious deployment platform, then the floating-point network will need to be converted into a fixed-point equivalent as illustrated below (although 16-bit fixed-point implementations are common, a lot of success is being seen with as low as 8-bit realizations).

(Source: Max Maxfield)

The folks at CEVA are doing some very interesting work in this area, including a network generator that can take a floating-point representation of a network -- Caffe-based or TenserFlow-based (any topography) -- and transmogrify it into a small, fast, energy-efficient, fixed-point equivalent targeted at the CEVA-XM4 intelligent vision processor (see Push-Button Generation of Deep Neural Networks).

The last stage before actual usage is for the network to be deployed into the target system, which might be MCU, FPGA, or SoC-based, and which can now be used as part of an object detection and recognition system, for example.

(Source: Max Maxfield)

So far, so good, but...

There's an (un-optimized) elephant in the room
Like most things, the description above sounds great if you say it quickly and wave your arms about a lot. However, the developers who work in the trenches to create real-world systems know that there's a lot of stuff to worry about "under the hood" as it were.

Let's take the images used to train the network in the first place, for example. What equipment was used to capture them? On the physical side, we could be talking about things like the lens and the image sensor and the analog front end (AFE). On top of this we must consider all the algorithmic stages employed in the image processing pipeline (these can be implemented as software functions or using hardware accelerators), such as gain control, white balance, noise reduction and sharpening, color space transformation, interpolation, compression... the list goes on.

Of course, all of this also applies to any back-end camera systems used to capture and process the images that will ultimately be fed into the artificial neural network for the purposes of detection, recognition, classification, and any other desired actions.

The number of companies integrating cameras and intelligent vision technology into their products is increasing rapidly, and the image quality and accuracy of those systems is core to their value. In addition to the physical components like lenses and sensors, a typical image processing pipeline employs ~10 stages, each of which may have ~25 tuning parameters. Optimizing these systems across combinations of optics, sensors, processors, and algorithms requires intensive human effort, and this painstaking work must be performed for each product and variant, thereby limiting the number of alternative configurations that can be evaluated.

In order to address this issue, the folks at Algolux have architected an optimization platform called CRISP-ML (Computationally Reconfigurable Image Signal Platform) based on their machine learning solver that holistically tunes the imaging and computer vision algorithms based on standard image test charts, tagged training images, and key performance indicator (KPI) targets to achieve the desired image quality, vision accuracy, power, and performance targets across the specified imaging conditions. This dramatically reduces the time and costs associated with optimizing a new vision system while also allowing expert resources to be deployed towards higher-value assignments.

When I first heard about all of this, my knee-jerk reaction was to assume that the guys and gals at Algolux were using genetic algorithms to perform their magic (see this Introduction to Genetic Algorithms column that I penned a few years ago and also this Approximating Nonlinear Functions with Genetic Algorithms column). However, the CTO at Algolux, Paul Green, says that they don't use genetic algorithms per se, but instead they use "a combination of guided random search and "calculus-based search." Oooh, now my interest has really been aroused ("Down boy!") -- I look forward to learning more and reporting more in the not-so-distant future. Until then, I await your comments and/or questions in dread anticipation.

Loading comments...

Parts Search

Sponsored Blogs