Next time you take a “selfie” or a shot of that great meatball entrée you're about to consume at the corner Italian restaurant, you will be contributing to the collection of 880 billion photos that Yahoo expects will be taken in 2014. Every day, Facebook users upload 350 million pictures and Snapchat users upload more than 400 million images. Video is also increasingly popular, with sites like YouTube receiving 100 million hours of video every minute.
These statistics, forecasted to increase further in the coming years, are indicative of two fundamental realities: people like to take pictures and shoot video, and the increasingly ubiquitous camera phone makes it easy to do so. Cell phone manufacturers have recognized this opportunity , and their cameras' capabilities are becoming a significant differentiator between models, therefore a notable investment target.
However, image sensor technology is quickly approaching some fundamental limits. The geometries of the sensor pixels are approaching the wavelengths of visible light, making it increasingly difficult to shrink their dimensions further. For example, latest-generation image sensors are constructed using 1,100 nm pixels, leaving little spare room to capture red-spectrum (~700 nm wavelength) light. Also, as each pixel's silicon footprint shrinks, the amount of light it is capable of capturing and converting to a proportional electrical charge also decreases. This decrease in sensitivity increases noise in low-light conditions, and decreases the dynamic range – the ability to see details in shadows and bright areas of images. Since smaller pixel sensors can capture fewer photons, each photon has a more pronounced impact on each pixel’s brightness.
Resolution ceilings prompt refocus on other features
Given the challenges of increasing image sensor resolution, camera phone manufacturers appear reluctant to further promote the feature. As a case in point, Apple’s advertising for the latest iPhone 5s doesn’t even mention resolution, instead focusing generically on image quality and other camera features. Many of these features leverage computational photography – using increasing processing power and sophisticated vision algorithms to make better photographs. After taking pictures, such advanced camera phones can edit them in such a way that image flaws – blur, low light, color fidelity, etc – are eliminated. In addition, computational photography enables brand new applications, such as reproducing a photographed object on a 3D printer , automatically labeling pictures so that you can easily find them in the future, or easily removing that person who walked in front of you while you were taking an otherwise perfect picture.
High Dynamic Range (HDR) is an example of computational photography that is now found on many camera phones. A camera without this capability may be hampered by images with under- and/or over-exposed regions. It can be difficult, for example, to capture the detail found in a shadow without making the sky look pure white. Conversely, capturing detail in the sky can make shadows pitch black. With HDR, multiple pictures are taken at different exposure settings, some optimized for bright regions of the image (such as the sky in the example), while others are optimized for dark areas (i.e. shadows). HDR algorithms then select and combine the best details of these multiple pictures, using them to synthesize a new image that captures the nuances of the clouds in the sky and the details in the shadows. This merging also needs to comprehend and compensate for between-image camera and subject movement, along with determining the optimum amount of light for different portions of the image (Figure 1 ).
Figure 1: An example of a high dynamic range image created by combining images captured using different exposure times
Another example of computational photography is Super Resolution , wherein multiple images of a given scene are algorithmically combined, resulting in a final image that delivers finer detail than that present in any of the originals. A similar technique can be used to transform multiple poorly lit images into a higher-quality well-illuminated image. Movement between image frames, either caused by the camera or subject, increases the Super Resolution implementation challenge, since the resultant motion blur must be correctly differentiated from image noise and appropriately compensated for. In such cases, more intelligent (i.e. more computationally intensive) processing, such as object tracking, path prediction, and action identification, is required in order to combine pixels that might be in different locations of each image frame.
Multi-image combination and content subtraction
Users can now automatically “paint” a panorama image simply by capturing sequential frames of the entire scene from top to bottom and left to right, which are subsequently “stitched” together by means of computational photography algorithms. By means of this technique, the resolution of the resultant panorama picture will far exceed the native resolution of the camera phone's image sensor. As an example, a number of highly zoomed-in still pictures can be aggregated into a single image that shows a detailed city skyline, with the viewer then being able to zoom in and inspect the rooftop of a specific building (Figure 2). Microsoft (with its PhotoSynth application), CloudBurst Research, Apple (with the panorama feature built into latest-generation iPhones), and GigaPan are examples of companies and products that expose the potential of these sophisticated panorama algorithms, which do pattern matching and aspect ratio conversion as part of the “stitching” process.
Figure 2: The original version of this “stitched” panorama image is 8 gigapixels in size, roughly 1000x the resolution of a normal camera, and allows you to clearly view the people walking on the street when you zoom into it.
Revolutionary capabilities derived from inpainting – the process of reconstructing lost or deteriorated parts of images and videos – involve taking multiple pictures from slightly different perspectives and comparing them in order to differentiate between objects. Undesirable objects such as the proverbial “photobomb,” identified via their location changes from frame to frame, can then be removed easily (Figure 3). Some replacement schemes sample the (ideally uniform) area surrounding the removed object (such as a grassy field or blue sky) and use it to fill in the region containing the removed object. Other approaches use pattern matching and change detection techniques to fill in the resulting image “hole”. Alternatively, and similar to the green-screen techniques used in making movies, you can use computational photography algorithms to automatically and seamlessly insert a person or object into a still image or real-time video stream. The latter approach is beneficial in advanced videoconferencing setups , for example.
Figure 3: Object replacement enables the removable of unwanted items in a scene, such as this “photobombing” seal.
Extrapolating the third dimension
The ability to use computational photography to obtain “perfect focus” anywhere in a picture is being pursued by companies such as Lytro, Pelican Imaging, and Qualcomm. A technique known as plenoptic imaging involves taking multiple simultaneous pictures of the same scene, with the focus point for each picture set to a different distance. The images are then combined, placing every part of the final image in focus – or not – as desired (Figure 4). As a byproduct of this computational photography process, the user is also able to obtain a complete depth map for 3D image generation purposes, useful for 3D printing and other applications.
Figure 4: Plenoptic cameras enable you to post-capture refocus on near (a), mid (b), and far (c) objects, all within the same image.
Homography is another way of building up a 3D image with a 2D camera using computational photography. In this process, a user moves the camera around and sequentially takes a series of shots of an object and/or environment from multiple angles and perspectives. The subsequent processing of the various captured perspectives is an extrapolation of the stereoscopic processing done by our eyes, and is used to determine depth data. The coordinates of the different viewpoints can be known with high precision thanks to the inertia (accelerometer, gyroscope) and location (GPS, Wi-Fi, cellular triangulation, magnetometer, barometer) sensors now built into mobile phones. By capturing multiple photos at various locations, you can assemble a 3D model of a room you're in, with the subsequent ability to virtually place yourself anywhere in that room and see things from that perspective. Google's recently unveiled Project Tango smartphone exemplifies the 3D model concept. The resultant 3D image of an object can also feed a 3D printer for duplication purposes.
Capturing and compensating for action
Every home user will soon have the ability to shoot a movie that's as smooth as that produced by a Hollywood cameraman (Figure 5 ).Computational photography can be used to create stable photos andvideos, free of the blur artifacts that come from not holding the camerasteady. Leveraging the movement information coming from the previouslymentioned sensors, already integrated into cell phones, enables motioncompensation in the final image or video. Furthermore, promising newresearch is showing that video images can be stabilized in all three dimensions ,with the net effect of making it seem like the camera is smoothlymoving on rails, for example, even if the photographer is in factrunning while shooting the video.
Figure 5: Motion compensation via vision algorithms enablessteady shots with far less equipment than previously required withconventional schemes.
Cinemagraphs are an emerging new art medium that blends otherwise stillphotographs with minute movements. Think of a portrait where aperson's hair is blowing or the subject's eye periodically winks, or alandscape shot that captures wind effects.
The Cinemagraphs website offers some great examples of this new type of photography.Action, a Google Auto-Awesome feature, also enables multiple points oftime to be combined and displayed in one image. In this way, the fullrange of motion of a person jumping, for example – lift-off, in flight,and landing – or a bird flying, or a horse running can be captured in asingle shot (Figure 6 ).
Figure 6: This Google-provided example shows how you cancombine multiple shots (a) into a single image (b) to show motion overtime.
Other advanced features found in the wildly popular GoPro and itscompetitors are finding their way to your cell phone, as well.Leading-edge camera phones such as Apple's iPhone 5s offer the abilityto manipulate time in video; you can watch a person soar off a ski jumpin slow motion, or peruse an afternoon's worth of skiing footage injust a few minutes, and even do both in the same video. Slow-motioncapture in particular requires faster capture frame rates and highimage processing requirements, as well as significantly larger storagerequirements (Figure 7). As an alternative to consuming local resources, theseneeds align well with the increasingly robust ability to wirelesslystream high-resolution captured video to the “cloud” for remoteprocessing and storage.
The ability to stream high quality video to the cloud is also enablingimproved time-lapse photography and “life logging”. In addition to thecommercial applications of these concepts, such as a law enforcementofficer or emergency professional recording a response situation,personal applications also exist, such as an Alzheimer's patient usingvideo to improve memory recall or a consumer enhancing the appeal of anotherwise mundane home movie. The challenge with recording overextended periods of time is to intelligently and efficiently identifysignificant and relevant events to include in the video, labeling themfor easier subsequent search. Video surveillance companies have alreadycreated robust analytics algorithms to initiate video recording based on object movement, face detection, and other “triggers” . These same techniques will soon find their way into your personal cell phone.
Object identification and reality augmentation
The ubiquity of cell phone cameras is an enabler for new cell phoneapplications. For example, iOnRoad – acquired in 2013 by Harman –created a cell phone camera-based application to enable safer driving .With the cell phone mounted to the vehicle dashboard, it recognizesand issues alerts for unsafe driving conditions – following tooclosely, leaving your lane, etc – after analyzing the captured videostream. Conversely, when not driving, a user might enjoy anotherapplication called Vivino, which recognizes wine labels using imagerecognition algorithms licensed from a company called Kooaba.
Plenty of other object identification applications are coming tomarket. In early February, for example, Amazon added the “Flow” featureto its mobile app, which identifies objects (for sale in stores, forexample) you point the cell phone's camera at, tells you how muchAmazon is selling the item for, and enables you to place an order thenand there. This is one of myriad ways that the camera in a cell phone(or integrated into your glasses or watch, for that matter) will beable to identify objects and present the user with additionalinformation about them, a feature known as augmented reality .Applications are already emerging that identify and translate signswritten in foreign languages, enable a child to blend imaginarycharacters with real surroundings to create exciting new games, andcountless other examples (Figure 8 ).
Figure 8: This augmented reality example shows how the technology provides additional information about objects in an image.
And then of course there's perhaps the most important object of all, the human face. Face detection , face recognition, and other facial analysis algorithms represent one of the hottest areas of industry investment anddevelopment. Just in the past several years, Apple acquired Polar Rose,Google acquired Neven Vision, PittPatt , and Viewdle, Facebook acquired Face.com ,and Yahoo acquired IQ Engines. While some of these solutions implementfacial analysis “in the cloud”, other mobile-based solutions willeventually automatically tag individuals as their pictures are taken.Other facial analysis applications detect that a person is smilingand/or that their eyes are open, triggering the shutter action at thatprecise moment (Figure 9 ).
Figure 9: Face detection and recognition and other facial analysis capabilities are now available on many camera phones.
The compute performance required to enable computational photographyand its underlying vision processing algorithms is quite high.Historically, these functions and solutions have been exclusivelydeployed on more powerful desktop systems; mobile architectures hadinsufficient performance or a power budget that limited how much theywere able to do. However, with the increased interest and proliferationof these vision and computational photography functions, silicon architectures are being optimized to run these algorithms more efficiently and withlower power. The next article in this series will take a look at some ofthe emerging changes in processors and sensors that will makecomputational photography capabilities increasingly prevalent in thefuture.
Computational photography is one of the key applications being enabledand accelerated by the Embedded Vision Alliance, a worldwideorganization of technology developers and providers. Embedded visionrefers to the implementation vision technology in mobile devices,embedded systems, special-purpose PCs, and the “cloud”. To read moreabout the Alliance, go to
Expanding resources streamline the creation of Machines that See . The Alliance also sponsors regular Embedded Vision Summits , the most recent of which will be held May 29, 2014 in Santa Clara, Ca.
Michael McDonald is President of Skylane TechnologyConsulting, which provides marketing and consulting services tocompanies and startups primarily in vision, computational photography,and ADAS markets. He has over 20 years of experience working fortechnology leaders including Broadcom, Marvell, LSI, and AMD. He ishands-on and has helped define everything from hardware to software,and silicon to systems. Michael has previously managed $200+Mbusinesses, served as a general manager, managed both small and largeteams, and executed numerous partnerships and acquisitions. He has aBSEE from Rice University.