Robots, as long portrayed both in science fiction and shipping-product documentation, promise to free human beings from dull, monotonous and otherwise undesirable tasks, as well as to improve the quality of those tasks' outcomes through high speed and high precision. Consider, for example, the initial wave of autonomous consumer robotics systems that tackle vacuuming, carpet scrubbing, and even gutter-cleaning chores. Or consider the ever-increasing prevalence of robots in a diversity of manufacturing line environments (Figure 1 ).
Figure 1: Autonomous consumer-tailored products (a) and industrial manufacturing systems (b) are among the many classes of robots that can be functionally enhanced by vision processing capabilities.
First-generation autonomous consumer robots , however, employ relatively crude schemes for learning about and navigating their surroundings. These elementary techniques include human-erected barriers comprised of infrared transmitters, which coordinate with infrared sensors built into the robot to prevent it from tumbling down a set of stairs or wandering into another room.
A built-in shock sensor can inform the autonomous robot that it has collided with an object and shouldn't attempt to continue forward, and in more advanced mapping-capable designs, also shouldn’t revisit this location. And while first-generation manufacturing robots may work more tirelessly, faster, and more exactly than do their human forebears, their success is predicated on incoming parts arriving in fixed orientations and locations, thereby increasing the complexity of the manufacturing process. Any deviation in part position and/or orientation will result in assembly failures.
Humans use their eyes and other senses to discern the world around them and navigate through it. Theoretically, robotic systems should be able to do the same thing , leveraging camera assemblies, vision processors, and various software algorithms. Until recently, such technology has been found only in complex, expensive systems. However, cost, performance, and power consumption advances in digital integrated circuits are now paving the way for the proliferation of ‘vision’ into diverse and high-volume applications, including robot implementations. Challenges remain, but they're more easily, rapidly, and cost-effectively solved than has been possible before.
Developing robotic systems capable of adapting to their environments requires the use of computer vision algorithms that can convert the data from image sensors into actionable information about the environment. Two common tasks for robots are identifying external objects and their orientations, and determining the robot’s location and orientation. Many robots are designed to interact with one or more specific objects. For situation-adaptive robots, it's necessary for them to detect these objects when they are in unknown locations and orientations, as well as to comprehend that these objects might be moving.
Cameras produce millions of pixels of data per second, which creates a heavy processing burden. One way to resolve this challenge is to detect multi-pixel features, such as corners, blobs, edges, or lines, in each frame of video data (Figure 2 ).
Figure 2: Four primary stages are involved in fully processing the raw output of a 2D or 3D sensor for robotic vision, with each stage exhibiting unique characteristics and constraints in its processing requirements.
Such a pixel-to-feature transformation can lower the data processing requirement in this stage of the vision processing pipeline by a factor of a thousand or more; millions of pixels reduce to hundreds of features that a robot can then use to identify objects and determine their spatial characteristics (Figure 3 ).
Figure 3: Common feature detection algorithms include the MSER (Maximally Stable Extremal Regions) method (a), the SURF (Speeded Up Robust Features) algorithm (b), and the Shi-Tomasi technique for detecting corners (c) (courtesy MIT).
Detecting objects via features first involves gathering features taken from already-captured images of each specified object at various angles and orientations. Then, this database of features can train a machine learning algorithm, also known as a classifier, to accurately detect and identify new objects. Sometimes this training occurs on the robot; other times, due to the high level of computation required, the training occurs off-line. This complexity, coupled with the large amount of training data needed, are drawbacks to machine learning-based approaches. One of the best-known object detection algorithms is the Viola-Jones framework , which uses Haar-like features and a cascade of Adaboost classifiers. This algorithm is particularly good at identifying faces, and can also be trained to identify other common objects.
To determine object orientation via features requires an algorithm such as the statistics-based RANSAC (Random Sampling and Consensus). This algorithm uses a subset of features to model a potential object orientation, and then determines how many other features fit this model. The model with the largest number of matching features corresponds to the correctly recognized object orientation.
To detect moving objects, feature identification can be combined with tracking algorithms. Once a set of features has been used to correctly identify an object, filtering algorithms such as Kalman or Kanade-Lucas-Tomasi can track the movement of these features between video frames. Such techniques are robust in spite of changes in orientation and occlusion because they need only to track a subset of the original features in order to be successful.
The above algorithms may be sufficient for stationary robots. For robots on the move, however, additional algorithms are needed in order for them to move safely within their surroundings. Simultaneous Localization and Mapping is one category of algorithms that enables a robot to build of a map of its environment and keep track of its current location. Such algorithms require methods for mapping the environment in three dimensions. Many depth-sensing sensor options exist ; one common approach is to use a pair of 2D cameras configured as a “stereo” camera, acting similarly to the human visual system.
Stereo cameras rely on epipolar geometry to derive a 3D location for each point in a scene, using projections from a pair of 2D images. As previously discussed, features can also be used to detect useful locations within the 3D scene. For example, it is much easier for a robot to reliably detect the location of a corner of a table than the flat surface of a wall. At any given location and orientation, the robot can detect features that it can then compare to its internal map in order to locate itself and improve the map's quality. Given that the location of an object can change, a static map is often not useful for a robot attempting to adapt to its environment.
To create an efficient implementation ofrobot vision, it is useful to divide the required processing steps intostages. The processing encompassed by the previously discussedalgorithms can be divided into four stages, with each stage exhibitingunique characteristics and constraints in terms of its processingrequirements . A wide variety of vision processor types exist, anddifferent types may be better suited for each algorithm processing stagein terms of performance, power consumption, cost, function flexibility,and other factors. A vision processor chip may, in fact, integratemultiple types of processor cores to address the unique needs ofmultiple processing stages (Figure 4 ).
Figure 4: A vision processor may integrate multiple types of cores to address unique needs of multiple processing stages.
The first processing stage encompasses algorithms that handle sensor data pre-processing functions, such as:
- Color space conversion
- Image rotation and inversion
- Color adjustment and gamut mapping
- Gamma correction, and
- Contrast enhancement
Eachpixel in each frame is processed in this stage, so the number ofoperations per second is tremendous. In the case of stereo imageprocessing, the two image planes must be simultaneously processed. Forthese kinds of operations, one option is a dedicated hardware block,sometimes referred to as an IPU (image processing unit). Recentlyintroduced vision processors containing IPUs are able to handle twosimultaneous image planes, each with 2048×1536 pixel (3+ million pixel)resolution, at robust frame rates.
The second processing stagehandles feature detection, where corners, edges, and other significantimage regions are extracted. This processing step still works on apixel-by-pixel basis, so it is well suited for highly parallelarchitectures, this time capable of handling more complex mathematicalfunctions such as first- and second-order derivatives.
DSPs(digital signal processors), FPGAs (field programmable gate arrays),GPUs (graphics processing units), IPUs and APUs (array processor units)are all common processing options. DSPs and FPGAs are highly flexible,therefore particularly appealing when applications (and algorithms usedto implement them) are immature and evolving. This flexibility, however,can come with power consumption, performance, and cost tradeoffs versusalternative approaches.
On the other end of theflexibility-versus-focus spectrum is the dedicated-function IPU or APUdeveloped specifically for vision processing tasks. It can processseveral dozen billion operations per second but, by beingapplication-optimized, it is not a candidate for more widespreadfunction leverage. An intermediate step between theflexibility-versus-function optimization spectrum extremes is the GPU,historically found in computers but now also embedded within applicationprocessors used in smartphones, tablets, and other high-volumeapplications.
Floating-point calculations such as theleast-squares function in optical flow algorithms, descriptorcalculations in SURF (the Speeded Up Robust Features algorithm used forfast significant point detection), and point cloud processing are wellsuited for highly parallel GPU architectures. Such algorithms canalternatively run on SIMD (single-instruction multiple-data) vectorprocessing engines such as ARM's NEON or the AltiVec function blockfound within Power Architecture CPUs.
In the third imageprocessing stage, the system detects and classifies objects based onfeature maps. In contrast to the pixel-based processing of previousstages, these object detection algorithms are highly non-linear instructure and in the ways they access data. However, strong processing‘muscle’ is still required in order to evaluate many different featureswith a rich classification database.
Such requirements are idealfor single- and multi-core conventional processors, such as ARM- andPower Architecture-based RISC devices. This selection criterion isequally applicable for the fourth image processing stage, which tracksdetected objects across multiple frames, implements a model of theenvironment, and assesses whether various situations should triggeractions.
Development environments, frameworks, and libraries suchas OpenCL (the Open Computing Language), OpenCV (the Open SourceComputer Vision Library), and MATLAB can simplify and speed softwaretesting and development, enabling evaluation of sections of algorithmson different processing options, and including the ability to allocateportions of a task across multiple processing cores. Given thedata-intensive nature of vision processing, when evaluating processorsyou should appraise not only the number of cores and the per-core speedbut also each processor's data handling capabilities, such as itsexternal memory bus bandwidth.
Industry alliance assistance
Withthe emergence of increasingly capable processors, image sensors,memories, and other semiconductor devices, along with robust algorithms,it's becoming practical to incorporate computer vision capabilitiesinto a wide range of embedded systems. By ‘embedded system’, we mean anymicroprocessor-based system that isn’t a general-purpose computer.Embedded vision, therefore, refers to the implementation of computervision technology in embedded systems, mobile devices, special-purposePCs, and the cloud.
Embedded vision technology has the potentialto enable a wide range of electronic products (such as the roboticsystems discussed in this article) that are more intelligent andresponsive than before, and thus more valuable to users. It can addhelpful features to existing products. And it can provide significantnew markets for hardware, software, and semiconductor manufacturers.
Transforming a robotics vision processing idea into a shipping product entails careful discernment and compromise. The Embedded Vision Alliance catalyzes conversations in a forum where tradeoffs can be understoodand resolved, and where the effort to productize advanced roboticsystems can be accelerated, enabling system developers to effectivelyharness various vision technologies. (For more information, send anemail (info@Embedded-Vision.com) or call 925-954-1411. )
- “Embedded Low Power Vision Computing Platform for Automotive”, Michael Staudenmaier, Holger Gryska, Freescale Halbleiter Gmbh, Embedded World Nuremberg Conference, 2013.
Editor’s Note: Freescale and MathWorks are members of the Embedded Vision Alliance ,which has as its aim to provide engineers with practical education,information, and insights to help them incorporate embedded visioncapabilities into new and existing products.
Tutorial articles, videos, code downloads and a discussion forum staffed by a diversity of technology experts are available toregistered visitors to the site. By registering, visitors will alsoreceive the Alliance’s twice-monthly email newsletter.
The Alliance also sponsors the Embedded Vision Summit ,including an upcoming day-long technical educational forum to be heldon October 2nd in Boston, Mass. It is intended for engineers interested in incorporating visualintelligence into electronic systems and software.
The keynotepresenter will be Mario Munich, Vice President of Advanced Developmentat iRobot. Munich's previous company, Evolution Robotics (acquired byiRobot) developed the Mint, a second-generation consumer robot withvision processing capabilities.
Brian Dipert isEditor-In-Chief of the Embedded Vision Alliance. He is also a SeniorAnalyst at BDTI (Berkeley Design Technology, Inc.), which providesanalysis, advice, and engineering for embedded processing technology andapplications and Editor-In-Chief of InsideDSP, BDTI’sonline newsletter dedicated todigital signal processing technology. Brian has a B.S. degree inElectrical Engineering from Purdue University in West Lafayette, IN. Hisprofessional career began at Magnavox Electronics Systems in FortWayne, IN; Brian subsequently spent eight years at Intel Corporation inFolsom, CA. He then spent 14 years at EDN Magazine.
Yves Legrand is the global vertical marketing director for Industrial Automation andRobotics at Freescale Semiconductor. He is based in France and hasspent his professional career between Toulouse and the USA where heworked for Motorola Semiconductor and Freescale in Phoenix and Chicago.His marketing expertise ranges from wireless and consumer semiconductormarkets and applications to wireless charging and industrial automationsystems. He has a Masters degree in Electrical Engineering from GrenobleINPG in France, as well as a Masters degree in Industrial and SystemEngineering from San Jose State University, CA.
Bruce Tannenbaum leads the technical marketing team at MathWorks for image processingand computer vision applications. Earlier in his career, he was aproduct manager at imaging-related semiconductor companies such asSoundVision and Pixel Magic, and developed computer vision andwavelet-based image compression algorithms as an engineer at SarnoffCorporation (SRI). He holds a BSEE degree from Penn State University andan MSEE degree from the University of Michigan.