The term “embedded vision” refers to the use of computer vision in embedded systems, mobile devices, PCs, and the cloud. Stated another way, “embedded vision” refers to systems that extract meaning from visual inputs. Historically, such image analysis technology has only been found in complex, expensive systems such as military equipment, industrial robots, and quality-control inspection systems for manufacturing.
However, cost, performance, and power consumption advances in digital integrated circuits such as processors, memory devices, and image sensors are now paving the way for the proliferation of embedded vision into high-volume applications.
With a few notable exceptions, such as Microsoft's Kinect game console and computer peripheral, the bulk of today's embedded vision system designs employ 2D image sensors. 2D sensors enable a tremendous breadth and depth of vision capabilities. However, the inability of 2D image sensors to discern an object's distance from the sensor can make it difficult or impossible to implement some vision functions. And clever workarounds, such as supplementing 2D sensed representations with already known 3D models of identified objects (human hands, bodies, or faces, for example) can be too constraining.
In what kinds of applications would full 3D sensing be of notable value versus the more limited 2D alternative? Consider, for example, a gesture interface implementation .
The ability to discern motion not only up-and-down and side-to-side but also front-to-back greatly expands the variety, richness, and precision of the suite of gestures that a system can decode. Or consider face recognition, a biometrics application.
Depth sensing is valuable in determining that the object being sensed is an actual person's face , versus a photograph of that person's face. Alternative means of accomplishing this objective, such as requiring the biometric subject to blink during the sensing cycle , are inelegant in comparison.
Automotive advanced driver assistance system (ADAS) applications that benefit from 3D sensors are abundant. You can easily imagine, for example, the added value of being able to determine not only that another vehicle or object is in the roadway ahead of or behind you, but also to accurately discern its distance from you. Precisely determining the distance between your vehicle and a speed-limit-change sign is equally valuable in ascertaining how much time you have to slow down in order to avoid getting a ticket.
The need for accurate three-dimensional, no-contact scanning of real-life objects , whether for a medical instrument, in conjunction with increasingly popular 3D printers, or for some other purpose, is also obvious. And plenty of other compelling applications exist such as 3D videoconferencing, manufacturing line “binning” and defect screening , etc.
Stereoscopic vision (combining two 2-D image sensors) is currently the most common 3D sensor approach. Passive (i.e. relying solely on ambient light) range determination via stereoscopic vision utilizes the disparity in viewpoints between a pair of near-identical cameras to measure the distance to a subject of interest. In this approach, the centers of perspective of the two cameras are separated by a baseline or IPD (inter-pupillary distance) to generate the parallax necessary for depth measurement (Figure 1 ). Typically, the cameras’ optical axes are parallel to each other and orthogonal to the plane containing their centers of perspective.
Figure 1: Relative parallax shift as a function of distance. Subject A (nearby) induces a greater parallax than subject B (farther out), against a common background.
For a given subject distance, the IPD determines the angular separation θ of the subject as seen by the camera pair, and thus plays an important role in parallax detection. It dictates the operating range within which effective depth discrimination is possible, and it also influences depth resolution limits at various subject distances.
A relatively small baseline (i.e. several millimeters) is generally sufficient for very close operation such as gesture recognition using a mobile phone. Conversely, tracking a person’s hand from across a room requires the cameras to be spaced further apart. Generally, it is quite feasible to achieve depth accuracies of less than an inch at distances of up to 10 feet.
Implementation issues that must be considered in stereoscopic vision-based designs include the fact that when the subject is in motion, accurate parallax information requires precise camera synchronization, often at fast frame rates (e.g., 120 fps). The cameras must be, at minimum, synchronized during the commencement of a frame capture sequence.
An even better approach involves using a mode called “genlock”, where the line-scan timings of the two imagers are synchronized. Camera providers have developed a variety of sync-mode (using a master/slave configuration) and genlock-mode sensors for numerous applications, including forward-looking cameras in automobiles.
Alignment is another critical factor in stereoscopic vision. The lens systems must be as close to identical as possible, including magnification factors and pitch-roll-yaw orientations. Otherwise, inaccurate parallax measurements will result. Likewise, misalignment of individual lens elements within a camera module could cause varying aberrations, particularly distortions, resulting in false registration along all spatial dimensions. Occlusion, which occurs when an object or portion of an object is visible to one sensor but not to the other, is another area of concern, especially at closer ranges, but this is a challenge common in most depth sensing techniques.
Microsoft's Kinect is today's best known structured light-based 3D sensor. The structured light approach, like the time-of-flight technique to be discussed next, is one example of an active non-contact scanner; non-contact because scanning does not involve the sensor physically touching an object’s surface, and active because it generates its own electromagnetic radiation and analyzes the reflection of this radiation from the object. Typically, active non-contact scanners use lasers, LEDs, or lamps in the visible or infrared radiation range. Since these systems illuminate the object, they do not require separate controlled illumination of the object for accurate measurements. An optical sensor captures the reflected radiation.
Structured light is an optical 3D scanning method that projects a set of patterns onto an object, capturing the resulting image with an image sensor. The image sensor is offset from the projected patterns. Structured light replaces the previously discussed stereoscopic vision sensor's second imaging sensor with a projection component. Similar to stereoscopic vision techniques, this approach takes advantage of the known camera-to-projector separation to locate a specific point between them and compute the depth with triangulation algorithms. Thus, image processing and triangulation algorithms convert the distortion of the projected patterns, caused by surface roughness, into 3D information (Figure 2 ).
Three main types of scanners are used to implement structured light techniques: laser scanners, fixed-pattern scanners, and programmable-pattern scanners. Laser scanners typically utilize a laser in conjunction with a gyrating mirror to project a line on an object. This line is scanned at discrete steps across the object’s surface. An optical sensor, offset from the laser, captures each line scan on the surface of the object.
Fixed-pattern scanners utilize a laser or LED with a diffractive optical element to project a fixed pattern on the surface of the object. An optical sensor, offset from the laser, captures the projected pattern on the surface of the object. In contrast to a laser scanner, the optical sensor of a fixed-pattern scanner captures all of the projected patterns at once. Fixed-pattern scanners typically use pseudorandom binary patterns, such as those based on De Bruijn sequences or M-arrays. These pseudorandom patterns divide the acquired image into a set of sub-patterns that are easily identifiable, since each sub-pattern appears at most once in the image. Thus, this technique uses a spatial neighborhood codification approach.
Programmable-pattern scanners utilize laser, LED, or lamp illumination along with a digital spatial light modulator to project a series of patterns on the surface of the object. An optical sensor, offset from the projector, captures the projected pattern on the surface of the object. Similar to a fixed-pattern scanner, the optical sensor of the programmable-pattern scanner captures the entire projected pattern at once. The primary advantages of programmable-pattern structured light scanners versus fixed-pattern alternatives involve the ability to obtain greater depth accuracy via the use of multiple patterns, as well as to adapt the patterns in response to factors such as ambient light, the object’s surface, and the object’s optical reflection.
Since programmable-pattern structured light requires the projection of multiple patterns, a spatial light modulator provides a cost effective solution. Several spatial light modulation technologies exist in the market, including LCD (liquid crystal display), LCoS (liquid crystal on silicon), and DLP (digital light processing). DLP-based spatial light modulators' capabilities include fast and programmable pattern rates up to 20,000 frames per second, with 1-bit to 8-bit grey scale support, high contrast patterns, consistent and reliable performance over time and temperature, no motors or other fragile moving components, and available solutions with optical efficiency from 365 to 2500 nm wavelengths.
Structured light-based 3D sensor designs must optimize, and in some cases balance trade-offs between, multiple implementation factors. Sufficient illumination wavelength and power are needed to provide adequate dynamic range, based on ambient illumination and the scanned object's distance and reflectivity. Algorithms must be optimized for a particular application, taking into account the object's motion, topology, desired accuracy, and scanning speed. Adaptive object analysis decreases scanning speed, for example, but provides for a significant increase in accuracy. The resolution of the spatial light modulator and imaging sensor must be tailored to extract the desired accuracy from the system. This selection process primarily affects both cost and the amount of computation required.
Scanning speed is predominantly limited by image sensor performance; high-speed sensors can greatly increase system cost. Object occlusion can present problems, since the pattern projection might shadow a feature in the topology and thereby hide it from the captured image. Rotation of the scanned object, along with multiple analysis and stitching algorithms, provides a good solution for occlusion issues. Finally, system calibration must be comprehended in the design. It's possible to characterize and compensate for projection and imaging lens distortions, for example, since the measured data is based on code words, not on an image's disparity.
Time of flight
An indirect ToF (time of flight) system obtainstravel time information by measuring the delay or phase shift of amodulated optical signal for all pixels in the scene. Generally, thisoptical signal is situated in the near-infrared portion of the spectrumso as not to disturb human vision. The ToF sensor in the system consistsof an array of pixels, where each pixel is capable of determining thedistance to the scene.
Each pixel measures the delay of the received optical signal with respect to the sent signal (Figure 3 ).A correlation function is performed in each pixel, followed byaveraging or integration. The resulting correlation value thenrepresents the travel time or delay. Since all pixels obtain this valuesimultaneously, snap-shot 3D imaging is possible.
Figure3: Varying sent-to-received delays correlate to varying distancesbetween a ToF sensor and portions of an object or scene.
Aswith the other 3D sensor technologies discussed in this article, anumber of challenges need to be addressed in implementing a practicalToF-based system. First, the depth resolution (or noise uncertainty) ofthe ToF system is linked directly to the modulation frequency, theefficiency of correlation, and the SNR (signal-to-noise ratio). Thesespecifications are primarily determined by the quality of the pixels inthe ToF sensor. Dynamic range must be maximized in order to accuratelymeasure the depth of both close and far objects, particularly those withdiffering reflectivities.
Another technical challenge involvesthe suppression of any background ambient light present in the scene, inorder to prevent sensor saturation and enable robust operation in bothindoor and outdoor environments. Since more than one ToF system can bepresent, inter-camera crosstalk must also be eliminated. And all ofthese challenges must be addressed while keeping the pixel size smallenough to obtain the required lateral resolution without compromisingpixel accuracy.
No single 3D sensor technology can meet the needs of every application (Table A ).Stereoscopic vision technology demands high software complexity inorder to process and analyze highly precise 3D depth data in real time,thereby typically necessitating DSPs (digital signal processors) ormulticore processors.
Stereoscopic vision sensors can be costeffective and fit in small form factors, making them a good choice forconsumer electronics devices such smartphones and tablets. But theytypically cannot deliver the high accuracy and fast response timepossible with other 3D sensor technologies, so they may not be theoptimal choice for manufacturing quality assurance systems, for example.
Structuredlight technology is an ideal solution for 3D object scanning, includingintegration with 3D CAD (computer-aided design) systems. And structuredlight systems are often superior at delivering high levels of accuracywith less depth noise in indoor environments. The highly complexalgorithms associated with structured light sensors can be handled byhard-wired logic such as ASICs and FPGAs, but these approaches ofteninvolve expensive development and device costs (NRE and/orper-component). The high computation complexity can also result inslower response times.
ToF systems are tailored for devicecontrol in application areas that need fast response times, such asmanufacturing and consumer electronics devices. ToF systems alsotypically have low software complexity. However, they integrateexpensive illumination parts such as LEDs and laser diodes, as well ascostly high-speed interface-related parts such as fast ADCs, fastserial/parallel interfaces, and fast PWM (pulse width modulation)drivers, all of which increase bill-of-materials costs.
Industry alliance assistance
Determiningthe optimum 3D sensor technology for your next embedded vision designis not a straightforward undertaking. The ability to tap into thecollective knowledge and experiences of a community of your engineeringpeers can therefore be quite helpful, along with the ability to harnessthe knowledge of various potential technology suppliers.
These are among the many resources offered by the Embedded Vision Alliance ,a worldwide organization of semiconductor, software, and servicesdevelopers and providers, whose mission is to provide engineers withpractical education, information, and insights to help them incorporateembedded vision capabilities into products.
Editor’s Note: For more hands-on and up-to-date information on 3D vision, interested developers should attend the Embedded Vision Summit ,a free day-long technical educational forum to be held on April 25th inSan Jose, Ca. The event agenda includes how-to presentations, seminars,demonstrations, and opportunities to interact with Alliance membercompanies.To register, go to the Summit’s online registration form .
Michael Brading is Chief Technical Officer of the Automotive Industrial and Medical business unit at Aptina Imaging .Prior to that, Mike was Vice President of Engineering at InVisageTechnologies. Mike has more than 20 years of integrated circuit designexperience, working with design teams all over the world. Michael wasalso previously the director of design and applications at MicronTechnology, and the director of engineering for emerging markets. Andbefore joining Micron Technology, he also held engineering managementpositions with LSI Logic. Michael has a B.S. in communicationengineering from the University of Plymouth.
Kenneth Salsman is the Director of New Technology at Aptina Imaging. Kenneth has been aresearcher and research manager for more than 30 years at companiessuch as Bell Laboratories and the Sarnoff Research Center. He was alsoDirector of Technology Strategy for Compaq Research, and a LeadScientist at both Compaq and Intel. Kenneth has a Masters degree inNuclear Engineering, along with an extensive background in opticalphysics. He was also Chief Science Officer at Innurvation, where hedeveloped a pill sized HD optical scanning system for imaging thegastrointestinal tract. He holds more than 48 patents.
Manjunath Somayaji is the Imaging Systems Group manager at Aptina Imaging, where he leadsalgorithm development efforts on novel multi-aperture/array-cameraplatforms. For the past ten years, he has worked on numerouscomputational imaging technologies such as multi-aperture cameras andextended depth of field systems. He received his M.S. degree and Ph.D.from Southern Methodist University (SMU) and his B.E. from theUniversity of Mysore, all in Electrical Engineering. He was formerly aResearch Assistant Professor in SMU's Electrical Engineering department.Prior to SMU, he worked at OmniVision-CDM Optics as a Senior SystemsEngineer.
Tim Droz is the Vice President of US Operations at SoftKinetic .He joined SoftKinetic in 2011 after 10 years at Canesta, where he wasVice President of Platform Engineering and head of the EntertainmentSolutions Business Unit. Before then, Tim was Senior Director ofEngineering at Cylink. Tim also earlier led hardware development effortsin embedded and web-based signature capture payment terminals atpos.com, along with holding engineering positions at EDJ Enterprises andIBM. Tim earned a BSEE from the University of Virginia and a M.S.degree in Electrical and Computer Engineering from North Carolina StateUniversity.
Daniël Van Nieuwenhove is the ChiefTechnical Officer at SoftKinetic. He co-founded Optrima in 2009, andacted as the company's Chief Technical Officer and Vice President ofTechnology and Products. Optrima subsequently merged with SoftKinetic in2010. He received an engineering degree in electronics with greatdistinction at the VUB (Free University of Brussels) in 2002. Daniëlholds multiple patents and is the author of several scientific papers.In 2009, he obtained a Ph.D. degree on CMOS circuits and devices for 3Dtime-of-flight imagers. As co-founder of Optrima, he brought itsproprietary 3D CMOS time-of-flight sensors and imagers to market.
Pedro Gelabert is a Senior Member of the Technical Staff and Systems Engineer at Texas Instruments .He has more than 20 years of experience in DSP algorithm developmentand implementation, parallel processing, ultra-low power DSP systems andarchitectures, DLP applications, and optical processing, along witharchitecting digital and mixed signal devices. Pedro received his B.S.degree and Ph.D. in electrical engineering from the Georgia Institute ofTechnology. He is a member of the Institute of Electrical andElectronics Engineers, holds four patents and has published more than 40papers, articles, user guides, and application notes.
Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance .He is also a Senior Analyst at Berkeley Design Technology, Inc. He has a B.S. degree in Electrical Engineering fromPurdue University in West Lafayette, IN. His professional career beganat Magnavox Electronics Systems in Fort Wayne, IN; Brian subsequentlyspent eight years at Intel Corporation in Folsom, CA. He then spent 14years at EDN Magazine.