Using 3D sensors to bring depth discernment to embedded vision apps

Michael Brading, Kenneth Salsman and Manjunath Somayaji, Aptina Imaging; Tim Droz and Daniël Van Nieuwenhove, SoftKinetic; Pedro Gelabert, TI; and Brian Dipert, BDTI and Embedded Vision Alliance - April 13, 2013


The term "embedded vision" refers to the use of computer vision in embedded systems, mobile devices, PCs, and the cloud. Stated another way, "embedded vision" refers to systems that extract meaning from visual inputs. Historically, such image analysis technology has only been found in complex, expensive systems such as military equipment, industrial robots, and quality-control inspection systems for manufacturing.

However, cost, performance, and power consumption advances in digital integrated circuits such as processors, memory devices, and image sensors are now paving the way for the proliferation of embedded vision into high-volume applications.

With a few notable exceptions, such as Microsoft's Kinect game console and computer peripheral, the bulk of today's embedded vision system designs employ 2D image sensors. 2D sensors enable a tremendous breadth and depth of vision capabilities. However, the inability of 2D image sensors to discern an object's distance from the sensor can make it difficult or impossible to implement some vision functions. And clever workarounds, such as supplementing 2D sensed representations with already known 3D models of identified objects (human hands, bodies, or faces, for example) can be too constraining.

In what kinds of applications would full 3D sensing be of notable value versus the more limited 2D alternative? Consider, for example, a gesture interface implementation.

The ability to discern motion not only up-and-down and side-to-side but also front-to-back greatly expands the variety, richness, and precision of the suite of gestures that a system can decode. Or consider face recognition, a biometrics application.

Depth sensing is valuable in determining that the object being sensed is an actual person's face, versus a photograph of that person's face. Alternative means of accomplishing this objective, such as requiring the biometric subject to blink during the sensing cycle, are inelegant in comparison.

Automotive advanced driver assistance system (ADAS) applications that benefit from 3D sensors are abundant. You can easily imagine, for example, the added value of being able to determine not only that another vehicle or object is in the roadway ahead of or behind you, but also to accurately discern its distance from you. Precisely determining the distance between your vehicle and a speed-limit-change sign is equally valuable in ascertaining how much time you have to slow down in order to avoid getting a ticket.

The need for accurate three-dimensional, no-contact scanning of real-life objects, whether for a medical instrument, in conjunction with increasingly popular 3D printers, or for some other purpose, is also obvious. And plenty of other compelling applications exist such as 3D videoconferencing, manufacturing line "binning" and defect screening , etc.

Stereoscopic vision
Stereoscopic vision (combining two 2-D image sensors) is currently the most common 3D sensor approach. Passive (i.e. relying solely on ambient light) range determination via stereoscopic vision utilizes the disparity in viewpoints between a pair of near-identical cameras to measure the distance to a subject of interest. In this approach, the centers of perspective of the two cameras are separated by a baseline or IPD (inter-pupillary distance) to generate the parallax necessary for depth measurement (Figure 1). Typically, the cameras’ optical axes are parallel to each other and orthogonal to the plane containing their centers of perspective.


Figure 1: Relative parallax shift as a function of distance. Subject A (nearby) induces a greater parallax than subject B (farther out), against a common background.

For a given subject distance, the IPD determines the angular separation θ of the subject as seen by the camera pair, and thus plays an important role in parallax detection. It dictates the operating range within which effective depth discrimination is possible, and it also influences depth resolution limits at various subject distances.

A relatively small baseline (i.e. several millimeters) is generally sufficient for very close operation such as gesture recognition using a mobile phone. Conversely, tracking a person’s hand from across a room requires the cameras to be spaced further apart. Generally, it is quite feasible to achieve depth accuracies of less than an inch at distances of up to 10 feet.

Implementation issues that must be considered in stereoscopic vision-based designs include the fact that when the subject is in motion, accurate parallax information requires precise camera synchronization, often at fast frame rates (e.g., 120 fps). The cameras must be, at minimum, synchronized during the commencement of a frame capture sequence.

An even better approach involves using a mode called “genlock”, where the line-scan timings of the two imagers are synchronized. Camera providers have developed a variety of sync-mode (using a master/slave configuration) and genlock-mode sensors for numerous applications, including forward-looking cameras in automobiles.

Alignment is another critical factor in stereoscopic vision. The lens systems must be as close to identical as possible, including magnification factors and pitch-roll-yaw orientations. Otherwise, inaccurate parallax measurements will result. Likewise, misalignment of individual lens elements within a camera module could cause varying aberrations, particularly distortions, resulting in false registration along all spatial dimensions. Occlusion, which occurs when an object or portion of an object is visible to one sensor but not to the other, is another area of concern, especially at closer ranges, but this is a challenge common in most depth sensing techniques.

Structured light
Microsoft's Kinect is today's best known structured light-based 3D sensor. The structured light approach, like the time-of-flight technique to be discussed next, is one example of an active non-contact scanner; non-contact because scanning does not involve the sensor physically touching an object’s surface, and active because it generates its own electromagnetic radiation and analyzes the reflection of this radiation from the object. Typically, active non-contact scanners use lasers, LEDs, or lamps in the visible or infrared radiation range. Since these systems illuminate the object, they do not require separate controlled illumination of the object for accurate measurements. An optical sensor captures the reflected radiation.

Structured light is an optical 3D scanning method that projects a set of patterns onto an object, capturing the resulting image with an image sensor. The image sensor is offset from the projected patterns. Structured light replaces the previously discussed stereoscopic vision sensor's second imaging sensor with a projection component. Similar to stereoscopic vision techniques, this approach takes advantage of the known camera-to-projector separation to locate a specific point between them and compute the depth with triangulation algorithms. Thus, image processing and triangulation algorithms convert the distortion of the projected patterns, caused by surface roughness, into 3D information (Figure 2).


Figure 2: An example structured light implementation using a DLP-based modulator.

Three main types of scanners are used to implement structured light techniques: laser scanners, fixed-pattern scanners, and programmable-pattern scanners. Laser scanners typically utilize a laser in conjunction with a gyrating mirror to project a line on an object. This line is scanned at discrete steps across the object’s surface. An optical sensor, offset from the laser, captures each line scan on the surface of the object.

Fixed-pattern scanners utilize a laser or LED with a diffractive optical element to project a fixed pattern on the surface of the object. An optical sensor, offset from the laser, captures the projected pattern on the surface of the object. In contrast to a laser scanner, the optical sensor of a fixed-pattern scanner captures all of the projected patterns at once. Fixed-pattern scanners typically use pseudorandom binary patterns, such as those based on De Bruijn sequences or M-arrays. These pseudorandom patterns divide the acquired image into a set of sub-patterns that are easily identifiable, since each sub-pattern appears at most once in the image. Thus, this technique uses a spatial neighborhood codification approach.

Programmable-pattern scanners utilize laser, LED, or lamp illumination along with a digital spatial light modulator to project a series of patterns on the surface of the object. An optical sensor, offset from the projector, captures the projected pattern on the surface of the object. Similar to a fixed-pattern scanner, the optical sensor of the programmable-pattern scanner captures the entire projected pattern at once. The primary advantages of programmable-pattern structured light scanners versus fixed-pattern alternatives involve the ability to obtain greater depth accuracy via the use of multiple patterns, as well as to adapt the patterns in response to factors such as ambient light, the object’s surface, and the object’s optical reflection.

Since programmable-pattern structured light requires the projection of multiple patterns, a spatial light modulator provides a cost effective solution. Several spatial light modulation technologies exist in the market, including LCD (liquid crystal display), LCoS (liquid crystal on silicon), and DLP (digital light processing). DLP-based spatial light modulators' capabilities include fast and programmable pattern rates up to 20,000 frames per second, with 1-bit to 8-bit grey scale support, high contrast patterns, consistent and reliable performance over time and temperature, no motors or other fragile moving components, and available solutions with optical efficiency from 365 to 2500 nm wavelengths.

Structured light-based 3D sensor designs must optimize, and in some cases balance trade-offs between, multiple implementation factors. Sufficient illumination wavelength and power are needed to provide adequate dynamic range, based on ambient illumination and the scanned object's distance and reflectivity. Algorithms must be optimized for a particular application, taking into account the object's motion, topology, desired accuracy, and scanning speed. Adaptive object analysis decreases scanning speed, for example, but provides for a significant increase in accuracy. The resolution of the spatial light modulator and imaging sensor must be tailored to extract the desired accuracy from the system. This selection process primarily affects both cost and the amount of computation required.

Scanning speed is predominantly limited by image sensor performance; high-speed sensors can greatly increase system cost. Object occlusion can present problems, since the pattern projection might shadow a feature in the topology and thereby hide it from the captured image. Rotation of the scanned object, along with multiple analysis and stitching algorithms, provides a good solution for occlusion issues. Finally, system calibration must be comprehended in the design. It's possible to characterize and compensate for projection and imaging lens distortions, for example, since the measured data is based on code words, not on an image's disparity.


Page 2 of 2
Time of flight
An indirect ToF (time of flight) system obtains travel time information by measuring the delay or phase shift of a modulated optical signal for all pixels in the scene. Generally, this optical signal is situated in the near-infrared portion of the spectrum so as not to disturb human vision. The ToF sensor in the system consists of an array of pixels, where each pixel is capable of determining the distance to the scene.

Each pixel measures the delay of the received optical signal with respect to the sent signal (Figure 3). A correlation function is performed in each pixel, followed by averaging or integration. The resulting correlation value then represents the travel time or delay. Since all pixels obtain this value simultaneously, snap-shot 3D imaging is possible.


Figure 3: Varying sent-to-received delays correlate to varying distances between a ToF sensor and portions of an object or scene.

As with the other 3D sensor technologies discussed in this article, a number of challenges need to be addressed in implementing a practical ToF-based system. First, the depth resolution (or noise uncertainty) of the ToF system is linked directly to the modulation frequency, the efficiency of correlation, and the SNR (signal-to-noise ratio). These specifications are primarily determined by the quality of the pixels in the ToF sensor. Dynamic range must be maximized in order to accurately measure the depth of both close and far objects, particularly those with differing reflectivities.

Another technical challenge involves the suppression of any background ambient light present in the scene, in order to prevent sensor saturation and enable robust operation in both indoor and outdoor environments. Since more than one ToF system can be present, inter-camera crosstalk must also be eliminated. And all of these challenges must be addressed while keeping the pixel size small enough to obtain the required lateral resolution without compromising pixel accuracy.

Technology comparisons
No single 3D sensor technology can meet the needs of every application (Table A). Stereoscopic vision technology demands high software complexity in order to process and analyze highly precise 3D depth data in real time, thereby typically necessitating DSPs (digital signal processors) or multicore processors.

Stereoscopic vision sensors can be cost effective and fit in small form factors, making them a good choice for consumer electronics devices such smartphones and tablets. But they typically cannot deliver the high accuracy and fast response time possible with other 3D sensor technologies, so they may not be the optimal choice for manufacturing quality assurance systems, for example.


Table A: 3D vision sensor technology comparisons.

Structured light technology is an ideal solution for 3D object scanning, including integration with 3D CAD (computer-aided design) systems. And structured light systems are often superior at delivering high levels of accuracy with less depth noise in indoor environments. The highly complex algorithms associated with structured light sensors can be handled by hard-wired logic such as ASICs and FPGAs, but these approaches often involve expensive development and device costs (NRE and/or per-component). The high computation complexity can also result in slower response times.

ToF systems are tailored for device control in application areas that need fast response times, such as manufacturing and consumer electronics devices. ToF systems also typically have low software complexity. However, they integrate expensive illumination parts such as LEDs and laser diodes, as well as costly high-speed interface-related parts such as fast ADCs, fast serial/parallel interfaces, and fast PWM (pulse width modulation) drivers, all of which increase bill-of-materials costs.

Industry alliance assistance
Determining the optimum 3D sensor technology for your next embedded vision design is not a straightforward undertaking. The ability to tap into the collective knowledge and experiences of a community of your engineering peers can therefore be quite helpful, along with the ability to harness the knowledge of various potential technology suppliers.

These are among the many resources offered by the Embedded Vision Alliance, a worldwide organization of semiconductor, software, and services developers and providers, whose mission is to provide engineers with practical education, information, and insights to help them incorporate embedded vision capabilities into products.

Editor’s Note: For more hands-on and up-to-date information on 3D vision, interested developers should attend the Embedded Vision Summit, a free day-long technical educational forum to be held on April 25th in San Jose, Ca. The event agenda includes how-to presentations, seminars, demonstrations, and opportunities to interact with Alliance member companies.To register, go to the Summit’s online registration form.

Michael Brading is Chief Technical Officer of the Automotive Industrial and Medical business unit at Aptina Imaging. Prior to that, Mike was Vice President of Engineering at InVisage Technologies. Mike has more than 20 years of integrated circuit design experience, working with design teams all over the world. Michael was also previously the director of design and applications at Micron Technology, and the director of engineering for emerging markets. And before joining Micron Technology, he also held engineering management positions with LSI Logic. Michael has a B.S. in communication engineering from the University of Plymouth.

Kenneth Salsman is the Director of New Technology at Aptina Imaging. Kenneth has been a researcher and research manager for more than 30 years at companies such as Bell Laboratories and the Sarnoff Research Center. He was also Director of Technology Strategy for Compaq Research, and a Lead Scientist at both Compaq and Intel. Kenneth has a Masters degree in Nuclear Engineering, along with an extensive background in optical physics. He was also Chief Science Officer at Innurvation, where he developed a pill sized HD optical scanning system for imaging the gastrointestinal tract. He holds more than 48 patents.

Manjunath Somayaji is the Imaging Systems Group manager at Aptina Imaging, where he leads algorithm development efforts on novel multi-aperture/array-camera platforms. For the past ten years, he has worked on numerous computational imaging technologies such as multi-aperture cameras and extended depth of field systems. He received his M.S. degree and Ph.D. from Southern Methodist University (SMU) and his B.E. from the University of Mysore, all in Electrical Engineering. He was formerly a Research Assistant Professor in SMU's Electrical Engineering department. Prior to SMU, he worked at OmniVision-CDM Optics as a Senior Systems Engineer.

Tim Droz is the Vice President of US Operations at SoftKinetic. He joined SoftKinetic in 2011 after 10 years at Canesta, where he was Vice President of Platform Engineering and head of the Entertainment Solutions Business Unit. Before then, Tim was Senior Director of Engineering at Cylink. Tim also earlier led hardware development efforts in embedded and web-based signature capture payment terminals at pos.com, along with holding engineering positions at EDJ Enterprises and IBM. Tim earned a BSEE from the University of Virginia and a M.S. degree in Electrical and Computer Engineering from North Carolina State University.

Daniël Van Nieuwenhove is the Chief Technical Officer at SoftKinetic. He co-founded Optrima in 2009, and acted as the company's Chief Technical Officer and Vice President of Technology and Products. Optrima subsequently merged with SoftKinetic in 2010. He received an engineering degree in electronics with great distinction at the VUB (Free University of Brussels) in 2002. Daniël holds multiple patents and is the author of several scientific papers. In 2009, he obtained a Ph.D. degree on CMOS circuits and devices for 3D time-of-flight imagers. As co-founder of Optrima, he brought its proprietary 3D CMOS time-of-flight sensors and imagers to market.

Pedro Gelabert is a Senior Member of the Technical Staff and Systems Engineer at Texas Instruments. He has more than 20 years of experience in DSP algorithm development and implementation, parallel processing, ultra-low power DSP systems and architectures, DLP applications, and optical processing, along with architecting digital and mixed signal devices. Pedro received his B.S. degree and Ph.D. in electrical engineering from the Georgia Institute of Technology. He is a member of the Institute of Electrical and Electronics Engineers, holds four patents and has published more than 40 papers, articles, user guides, and application notes.

Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance. He is also a Senior Analyst at Berkeley Design Technology, Inc. He has a B.S. degree in Electrical Engineering from Purdue University in West Lafayette, IN. His professional career began at Magnavox Electronics Systems in Fort Wayne, IN; Brian subsequently spent eight years at Intel Corporation in Folsom, CA. He then spent 14 years at EDN Magazine.