Giving robotic systems spatial sensing with visual intelligence
Robots, as long portrayed both in science fiction and shipping-product documentation, promise to free human beings from dull, monotonous and otherwise undesirable tasks, as well as to improve the quality of those tasks' outcomes through high speed and high precision. Consider, for example, the initial wave of autonomous consumer robotics systems that tackle vacuuming, carpet scrubbing, and even gutter-cleaning chores. Or consider the ever-increasing prevalence of robots in a diversity of manufacturing line environments (Figure 1).
Figure 1: Autonomous consumer-tailored products (a) and industrial manufacturing systems (b) are among the many classes of robots that can be functionally enhanced by vision processing capabilities.
First-generation autonomous consumer robots, however, employ relatively crude schemes for learning about and navigating their surroundings. These elementary techniques include human-erected barriers comprised of infrared transmitters, which coordinate with infrared sensors built into the robot to prevent it from tumbling down a set of stairs or wandering into another room.
A built-in shock sensor can inform the autonomous robot that it has collided with an object and shouldn't attempt to continue forward, and in more advanced mapping-capable designs, also shouldn’t revisit this location. And while first-generation manufacturing robots may work more tirelessly, faster, and more exactly than do their human forebears, their success is predicated on incoming parts arriving in fixed orientations and locations, thereby increasing the complexity of the manufacturing process. Any deviation in part position and/or orientation will result in assembly failures.
Humans use their eyes and other senses to discern the world around them and navigate through it. Theoretically, robotic systems should be able to do the same thing, leveraging camera assemblies, vision processors, and various software algorithms. Until recently, such technology has been found only in complex, expensive systems. However, cost, performance, and power consumption advances in digital integrated circuits are now paving the way for the proliferation of ‘vision’ into diverse and high-volume applications, including robot implementations. Challenges remain, but they're more easily, rapidly, and cost-effectively solved than has been possible before.
Developing robotic systems capable of adapting to their environments requires the use of computer vision algorithms that can convert the data from image sensors into actionable information about the environment. Two common tasks for robots are identifying external objects and their orientations, and determining the robot’s location and orientation. Many robots are designed to interact with one or more specific objects. For situation-adaptive robots, it's necessary for them to detect these objects when they are in unknown locations and orientations, as well as to comprehend that these objects might be moving.
Cameras produce millions of pixels of data per second, which creates a heavy processing burden. One way to resolve this challenge is to detect multi-pixel features, such as corners, blobs, edges, or lines, in each frame of video data (Figure 2).
Figure 2: Four primary stages are involved in fully processing the raw output of a 2D or 3D sensor for robotic vision, with each stage exhibiting unique characteristics and constraints in its processing requirements.
Such a pixel-to-feature transformation can lower the data processing requirement in this stage of the vision processing pipeline by a factor of a thousand or more; millions of pixels reduce to hundreds of features that a robot can then use to identify objects and determine their spatial characteristics (Figure 3).
Figure 3: Common feature detection algorithms include the MSER (Maximally Stable Extremal Regions) method (a), the SURF (Speeded Up Robust Features) algorithm (b), and the Shi-Tomasi technique for detecting corners (c) (courtesy MIT).
Detecting objects via features first involves gathering features taken from already-captured images of each specified object at various angles and orientations. Then, this database of features can train a machine learning algorithm, also known as a classifier, to accurately detect and identify new objects. Sometimes this training occurs on the robot; other times, due to the high level of computation required, the training occurs off-line. This complexity, coupled with the large amount of training data needed, are drawbacks to machine learning-based approaches. One of the best-known object detection algorithms is the Viola-Jones framework, which uses Haar-like features and a cascade of Adaboost classifiers. This algorithm is particularly good at identifying faces, and can also be trained to identify other common objects.
To determine object orientation via features requires an algorithm such as the statistics-based RANSAC (Random Sampling and Consensus). This algorithm uses a subset of features to model a potential object orientation, and then determines how many other features fit this model. The model with the largest number of matching features corresponds to the correctly recognized object orientation.
To detect moving objects, feature identification can be combined with tracking algorithms. Once a set of features has been used to correctly identify an object, filtering algorithms such as Kalman or Kanade-Lucas-Tomasi can track the movement of these features between video frames. Such techniques are robust in spite of changes in orientation and occlusion because they need only to track a subset of the original features in order to be successful.
The above algorithms may be sufficient for stationary robots. For robots on the move, however, additional algorithms are needed in order for them to move safely within their surroundings. Simultaneous Localization and Mapping is one category of algorithms that enables a robot to build of a map of its environment and keep track of its current location. Such algorithms require methods for mapping the environment in three dimensions. Many depth-sensing sensor options exist ; one common approach is to use a pair of 2D cameras configured as a “stereo” camera, acting similarly to the human visual system.
Stereo cameras rely on epipolar geometry to derive a 3D location for each point in a scene, using projections from a pair of 2D images. As previously discussed, features can also be used to detect useful locations within the 3D scene. For example, it is much easier for a robot to reliably detect the location of a corner of a table than the flat surface of a wall. At any given location and orientation, the robot can detect features that it can then compare to its internal map in order to locate itself and improve the map's quality. Given that the location of an object can change, a static map is often not useful for a robot attempting to adapt to its environment.