Once-hot technology markets such as computers and smartphones are beginning to cool off; analyst firm IDC, for example, forecast earlier this year that smartphone sales will increase only 19% this year, down from 39% in 2013. IDC also believes that beginning in 2018, annual smartphone sales increases will diminish to single-digit rates. Semiconductor, software, and electronic systems suppliers are therefore searching for the next growth opportunities, and wearable devices are likely candidates.
Analyst firm Canalys, for example, recently forecast shipments of more than 17 million ‘smart band’ wearables this year, when Canalys predicts the product category will become a key consumer technology, and that shipments will expand to more than 23 million units by 2015 and over 45 million by 2017. In the near term, the bulk of wearable shipments will consist of activity trackers and other smart bands, point of view cameras, and smart watches, but other wearable product types will also become more common, including smart glasses and ‘life recorder’ devices.
These wearable products can be greatly enhanced (and in some cases are fundamentally enabled) by their ability to process incoming still and video image information. Vision processing is more than just capturing snapshots and video clips for subsequent playback and sharing; it involves automated analysis of the scene and its constituent elements, along with appropriate device responses based on the analysis results. Historically known as ‘computer vision’, traditionally vision processing has been the bailiwick of large, heavy, expensive, and power-hungry PCs and servers.
Now, however, although ‘cloud’-based processing may be used in some cases, the combination of fast, high-quality, inexpensive, and energy-efficient processors, image sensors, and software are enabling robust vision processing to take place right on your wrist (or your face, or elsewhere on your person), at price points that enable adoption by the masses. And an industry alliance comprised of leading technology and service suppliers is a key factor in this burgeoning technology success story.
Form factor alternatives
The product category known as ‘wearables’ comprises a number of specific product types in which vision processing is a compelling fit. Perhaps the best known of these, by virtue of Google’s advocacy, are the ‘smart glasses’ exemplified by Google Glass (Figure 1). The current Glass design contains a single camera capable of capturing 5 Mpixel images and 720p streams. Its base functionality encompasses both conventional still and video photography. But Google Glass is capable of much more, as both Google's and third-party developers' initial applications are making clear.
Figure 1: Google Glass has singlehandedly created the smart glasses market (a), for which vision processing-enabled gestures offer a compelling alternative to clumsy button presses for user interface control purposes (b).
Consider that object recognition enables you to comparison-shop, displaying a list of prices offered by online and brick-and-mortar merchants for a product that you're currently looking at. Consider that this same object recognition capability, in ‘sensor fusion’ combination with GPS, compass, barometer/altimeter, accelerometer, gyroscope, and other facilities, enables those same smart glasses to provide you with augmented reality information about your vacation sight-seeing scenes. And consider that facial recognition will someday provide augmented data about the person standing in front of you, whose name you may or may not already recall.
Trendsetting current products suggest that these concepts will all become mainstream in the near future. Amazon's Fire Phone , for example, offers Firefly vision processing technology, which enables a user to “quickly identify printed web and email addresses, phone numbers, QR and bar codes, plus over 100 million items, including movies, TV episodes, songs, and products.”
OrCam's smart camera accessory for glasses operates similarly; intended for the visually impaired, it recognizes text and products, and speaks to the user via a bone-conduction earpiece. And although real-time individual identification via facial analysis may not yet be feasible in a wearable device, a system developed by the Fraunhofer Institute already enables accurate discernment of the age, gender, and emotional state of the person your Google Glass set is looking at.
While a single camera is capable of implementing such features, speed and accuracy can be improved when a depth-sensing sensor is employed. Smart glasses' dual-lens arrangement is a natural fit for a dual-camera stereoscopic depth-discerning setup. Other 3D sensor technologies such as time-of-flight and structured light are also possibilities.
And, versus a smartphone or tablet, smart glasses' thicker form factors are amenable to the inclusion of deeper-dimensioned, higher quality optics. 3D sensors are also beneficial in accurately discerning finely detailed gestures used to control the glasses' various functions, in addition to (or instead of) button presses, voice commands, and Tourette Syndrome-reminiscent head twitches.
Point of view (POV) cameras are another wearable product category that can benefit from vision processing-enabled capabilities (Figure 2 ). Currently, they're most commonly used to capture the wearer's experiences while doing challenging activities such as bicycling, motorcycling, snowboarding, surfing, and the like. In such cases, a gesture-based interface to stop and stop recording may be preferable to button presses that are difficult-to-impossible with thick gloves or when it is clumsy or impossible to use fingers.
Figure 2: The point of view (POV) camera is increasingly “hot”, as GoPro's recent initial public offering and subsequent stock-price doubling exemplify (a). With both the POV camera and the related (and more embryonic) ‘life camera’, which has experienced rapid product evolution (b), intelligent image post-processing to cull uninteresting portions of the content is a valuable capability.
POV cameras are also increasingly being used in situations where wearer control isn't an option, such as when they're strapped to pets or mounted on drones. And the constantly recording, so-called ‘life camera’ is beginning to transition from a research oddity to an early-adopter, trendsetter device. In all of these examples, computational photography intelligence can render in final form only those images and video frame sequences whose content is of greatest interest to potential viewers, versus generating content containing a high percentage of boring or otherwise unappealing material (analogies to snooze-inducing slide shows many of us have been forced to endure by friends and family members are apt).
Vision processing that's 'handy'
A wrist-strapped companion (or competitor) to smart glasses is the ‘smart watch’, intended to deliver a basic level of functionality when used standalone, along with an enhanced set of capabilities in conjunction with a wirelessly tethered smartphone, tablet, or other device (Figure 3 ). While a camera-inclusive smart watch could act as a still or video image capture device, its wrist-located viewpoint might prove to be inconvenient for all but occasional use.
Figure 3: The Android-based Moto 360 is a popular example of a first-generation smart watch, a product category that will benefit from the inclusion of visual intelligence in future iterations.
However, in smart watches gesture interface support for flipping through multiple diminutive screens of information is more obviously appealing, particularly given that the touch screen alternative is frequently hampered by sweat and other moisture sources, not to mention non-conductive gloves. Consider, too, that a periodically polling camera, driving facial detection software, could keep the smart watch's display disabled, thereby saving battery life, unless you're looking at the watch.
Finally, let's consider the other commonly mentioned wrist wearable, the activity tracker, also referred to as the fitness band or smart band when worn on the wrist (Figure 4 ). A recently announced smartphone-based application gives an early glimpse into how vision may evolve in this product category. The app works in conjunction with Jawbone's Up fitness band to supplement the band's calorie-consuming measurements by tracking food (i.e. calorie) intake in order to get a fuller picture of fitness, weight loss, and other related trends.
Figure 4: Fitness bands (a) and other activity trackers (b) are increasingly popular wearable products for which vision processing is also a likely near-future feature fit.
However, Jawbone's software currently requires that the user truthfully and consistently enters the meal and snack information manually. What if, instead, object recognition algorithms were used to automatically identify items on the user's plate and, after also assessing their portion sizes, calorie counts? And what if, instead of running on a separate mobile device as is currently the case, they were to leverage a camera built directly into the activity tracker?
Let's look first at the ability ofvarious vision-processing functions to detect and recognize objects inthe field of view of the wearable device. In addition to thealready-mentioned applications of the technology, this function may alsobe used to automatically tag images in real time while doing videorecording in a POV or life camera. Such a feature can be useful ingenerating metadata associated with detected and recognized objects tomake the resulting video much more search-friendly. Object recognitioncan also be combined with gaze tracking in a smart glasses implementation so that only those objects specifically being looked at are detected and classified.
Objectdetection and recognition will also be a key component of augmentedreality (AR) across a number of applications for wearables, ranging fromgaming to social media, advertising, and navigation. Natural featurerecognition for AR applications uses a feature matching approach ,recognizing objects by matching features in a query image to a sourceimage. The result is a flexible ability to train applications on imagesand use natural feature recognition for AR, employing wirelesslydelivered, augmented information coming from social media,encyclopedias, or other online sources, and displayed using graphicoverlays.
Natural feature detection and tracking avoids the needto use more restrictive marker-based approaches, wherein pre-definedfiducial markers are required to trigger the AR experience, althoughit's more challenging than marker-based AR from a processing standpoint.Trendsetting feature-tracking applications can be built today usingtoolsets such as Catchoom's CraftAR, and as the approach becomes morepervasive, it will allow for real-time recognition of objects in users'surroundings, with an associated AR experience.
Adding depth sensing to the AR experience brings surfaces and rooms to life in retail and other markets. The IKEA Catalog AR application ,for example, gives you the ability to place virtual furniture in yourown home using a mobile electronics device containing a conventionalcamera. You start by scanning a piece of furniture in an IKEA catalogpage, and then “use the catalog itself to judge the approximate scale ofthe furnishings – measuring the size of the catalog itself (laid on thefloor) in the camera and creating an augmented reality image of thefurnishings so it appears correctly in the room.”
With the addition of a depth sensor in a tablet or cellphone, such as one of Google's prototype Project Tango devices ,the need for the physical catalog as a measuring device is eliminatedas the physical dimensions of the room are measured directly, and thefurnishings in the catalog can be accurately placed to scale in thescene.
Not just hand waving
Wearable devices caninclude various types of human/machine interfaces (HMIs). Theseinterfaces can be classified into two main categories – behavioranalysis and intention analysis. Behavior analysis uses thevision-enabled wearable device for functions such as sign languagetranslation and lip reading, along with behavior interpretation forvarious security and surveillance applications. Intention analysis fordevice control includes such vision-based functions as gesture recognition , gaze tracking, and emotion detection ,along with voice commands. By means of intention analysis, a user cancontrol the wearable device and transfer relevant information to it forvarious activities such as games and AR applications.
Intentionanalysis use cases can also involve wake-up mechanisms for the wearable.For example, a smart watch with a camera that is otherwise in sleepmode may keep a small amount of power allocated to the image sensor and avision-processing core to enable a vision-based wake up system. Theimplementation might involve a simple gesture (like a hand wave) incombination with face detection (to confirm that the discerned objectmotion was human-sourced) to activate the device. Such vision processingneeds to occur at ~1mA current draw levels in order to not adverselyimpact battery life.
Wearable devices will drive computational photography forward by enabling more advanced camera subsystems and in generalpresenting new opportunities for image capture and vision processing.For example, smart glasses' deeper form factor compared to smartphonesallows for a thicker camera module, which enables the use of a higherquality optical zoom function along with (or instead of)pixel-interpolating digital zoom capabilities. The ~6″ baseline distancebetween glasses' temples also inherently enables wider stereoscopiccamera-to-camera spacing than is possible in a smartphone or tablet formfactor, thereby allowing for accurate use over a wider depth range.
Oneimportant function needed for a wearable device is stabilization forboth still and video images. While the human body (especially the head)naturally provides some stabilization, wearable devices will stillexperience significant oscillation and will therefore require robust digital stabilization facilities.Furthermore, wearable devices will frequently be used outdoors and willtherefore benefit from algorithms that compensate for environmentalvariables such as changing light and weather conditions.
Thesechallenges to image quality will require strong image enhancementfilters for noise removal, night-shot capabilities, dust handling, andmore. Image quality becomes even more important with applications suchas image mosaic, which builds up a panoramic view bycapturing multiple frames of a scene. Precise computational photographyto even out frame-to-frame exposure and stabilization differences iscritical to generating a high quality mosaic.
Depth-discerningsensors have already been mentioned as beneficial in object recognitionand gesture interface applications. They're applicable to computationalphotography as well, in supporting capabilities such as high dynamic range (HDR) and super-resolution (an advanced implementation of pixel interpolation).
Plus, they support plenoptic camera features thatallow for post-image-capture selective refocus on a portion of a scene,and other capabilities. All of these functions are compute-intensive,and sizes of wearable devices are especially challenging in this regardwith respect to factors such as size, weight, cost, power consumption,and heat dissipation.
Processing locations and allocations
Onekey advantage of using smart glasses for image capture and processingis ease of use – the user just records what he or she is looking at,hands-free. In combination with the ability to use higher qualitycameras with smart glasses, vision processing in wearable devices makes alot of sense. However, the batteries in today's wearable devices aremuch smaller than those in other mobile electronics devices – 570 mAhwith Google Glass, for example, vs ~2000 mAh for high-end smartphones.
Hence,it is currently difficult to do all of the necessary vision processingin a wearable device, due to power consumption limitations. Evolutions and revolutions in vision processors will make a completely resident processing scenario increasingly likelyin the future. Meanwhile, in the near term, a portion of the processingmay instead be done on a locally tethered device such a smartphone ortablet, and/or at cloud-based servers. Note that the decision to dolocal vs. remote processing doesn't involve battery life exclusively –thermal issues are also at play. The heat generated by compute-intensiveprocessing can produce discomfort, as has been noted with existingsmart glasses even during prolonged video recording sessions where nopost-processing occurs.
When doing video analysis, featuredetection and extraction can today be performed directly on the wearabledevice, with the generated metadata transmitted to a locally tethereddevice for object matching either there or, via the local device, in thecloud. Similarly, when using the wearable device for video recordingwith associated image tagging, vision processing to generate the imagetag metadata can currently be done on the wearable device, withpost-processing then continuing on an external device for power savings.
For3-D discernment, a depth map can be generated on the wearable device(at varying processing load requirements depending on the specific depthcamera technology chosen), with the point cloud map then sent to anexternal device to be used for classification or (for AR) camera poseestimation. Regardless of whether post-processing occurs on a locallytethered device or in the cloud, some amount of pre-processing directlyon the wearable device is still desirable in order to reduce datatransfer bandwidth locally over Bluetooth or Wi-Fi (therefore savingbattery life) or over a cellular wireless broadband connection to theInternet.
Even in cases like these, where vision processing issplit between the wearable device and other devices, the computer visionalgorithms running on the wearable device require significantcomputation. Feature detection and matching typically uses algorithmslike SURF (Speeded Up Robust Features) or SIFT (the Scale-InvariantFeature Transform), which are notably challenging to execute in realtime with conventional processor architectures.
While somefeature matching algorithms such BRIEF (Binary Robust IndependentElementary Features) combined with a lightweight feature detector areproviding lighter processing loads with reliable matching, a significantchallenge still exists in delivering real-time performance at therequired power consumption levels. Disparity mapping for stereo matchingto produce a 3D depth map is also compute-intensive, particularly whenhigh quality results are needed. Therefore, the vision processingrequirements of various wearable applications will continue to stimulatedemand for optimized vision processor architectures.
The opportunity for vision technology to expand the capabilities of wearable devices is part of a much larger trend. From consumer electronics to automotive safety systems ,vision technology is enabling a wide range of products that are moreintelligent and responsive than before, and thus more valuable to users.The Embedded Vision Alliance uses the term ‘embedded vision’ to referto this growing use of practical computer vision technology in embeddedsystems, mobile devices, special-purpose PCs, and the cloud, withwearable devices being one showcase application.
Visionprocessing can add valuable capabilities to existing products, such asthe vision-enhanced wearables discussed in this article. And it canprovide significant new markets for hardware, software, andsemiconductor suppliers. The Embedded Vision Alliance, a worldwideorganization of technology developers and providers, is working toempower product creators to transform this potential into reality. CEVA,CogniVue, and SoftKinetic, the co-authors of this article, are membersof the Embedded Vision Alliance.
Brian Dipert is Editor-In-Chief of the Embedded Vision Alliance. He is also a Senior Analyst at BDTI (Berkeley Design Technology, Inc. ),which provides analysis, advice, and engineering for embeddedprocessing technology and applications, and Editor-In-Chief ofInsideDSP, the company's online newsletter dedicated to digital signalprocessing technology. Brian has a B.S. degree in Electrical Engineeringfrom Purdue University in West Lafayette, IN. His professional careerbegan at Magnavox Electronics Systems in Fort Wayne, IN; Briansubsequently spent eight years at Intel Corporation in Folsom, CA. Hethen spent 14 years at EDN Magazine.
Ron Shalom is the Marketing Manager for Multimedia Applications at CEVA DSP .He holds an MBA from Tel Aviv University's Recanati Business School.Ron has over 15 years of experience in the embedded world; 9 years insoftware development and R&D management roles, and 6 years as amarketing manager. He has worked at CEVA for 10 years; 4 years as a teamleader in software codecs, and 6 years as a product marketing manager.
Tom Wilson is Vice President of Business Development at CogniVue Corporation ,with more than 20 years of experience in various applications such asconsumer, automotive, and telecommunications. He has held leadershiproles in engineering, sales and product management, and has a Bachelor’sof Science and PhD in Science from Carleton University, Ottawa, Canada.
Tim Droz is Senior Vice President and General Manager of SoftKinetic North America ,delivering 3D time-of-flight (TOF) image sensors, 3D cameras, andgesture recognition and other depth-based software solutions. Prior toSoftKinetic, he was Vice President of Platform Engineering and head ofthe Entertainment Solutions Business Unit at Canesta, acquired byMicrosoft. Tim earned a BSEE from the University of Virginia, and a M.S.degree in Electrical and Computer Engineering from North Carolina StateUniversity.
For more information on the Embedded Vision Alliance:
The Embedded Vision Alliance offers a free online training facility for vision-based product creators: the Embedded Vision Academy .This area of the Alliance website provides in-depth technical trainingand other resources to help product creators integrate visualintelligence into next-generation software and systems.
Coursematerial in the Embedded Vision Academy spans a wide range ofvision-related subjects, from basic vision algorithms to imagepre-processing, image sensor interfaces, and software developmenttechniques and tools such as OpenCV. Access is free to all through a simple registration process .
The Alliance also holds Embedded Vision Summit conferences in Silicon Valley. Embedded Vision Summits are technicaleducational forums for product creators interested in incorporatingvisual intelligence into electronic systems and software. They providehow-to presentations, inspiring keynote talks, demonstrations, andopportunities to interact with technical experts from Alliance membercompanies.
The most recent Embedded Vision Summit was held in May, 2014 ,and a comprehensive archive of keynote, technical tutorial and productdemonstration videos, along with presentation slide sets, is availableon the Alliance website. The next Embedded Vision Summit will take place on May 12, 2015 in Santa Clara, California)