Mobile photography's developing image

Michael McDonald, Skylane Technology, consultant, Embedded Vision Alliance

April 23, 2014

Michael McDonald, Skylane Technology, consultant, Embedded Vision AllianceApril 23, 2014

Next time you take a "selfie" or a shot of that great meatball entrée you're about to consume at the corner Italian restaurant, you will be contributing to the collection of 880 billion photos that Yahoo expects will be taken in 2014. Every day, Facebook users upload 350 million pictures and Snapchat users upload more than 400 million images. Video is also increasingly popular, with sites like YouTube receiving 100 million hours of video every minute.

These statistics, forecasted to increase further in the coming years, are indicative of two fundamental realities: people like to take pictures and shoot video, and the increasingly ubiquitous camera phone makes it easy to do so. Cell phone manufacturers have recognized this opportunity, and their cameras' capabilities are becoming a significant differentiator between models, therefore a notable investment target.

However, image sensor technology is quickly approaching some fundamental limits. The geometries of the sensor pixels are approaching the wavelengths of visible light, making it increasingly difficult to shrink their dimensions further. For example, latest-generation image sensors are constructed using 1,100 nm pixels, leaving little spare room to capture red-spectrum (~700 nm wavelength) light. Also, as each pixel's silicon footprint shrinks, the amount of light it is capable of capturing and converting to a proportional electrical charge also decreases. This decrease in sensitivity increases noise in low-light conditions, and decreases the dynamic range – the ability to see details in shadows and bright areas of images. Since smaller pixel sensors can capture fewer photons, each photon has a more pronounced impact on each pixel’s brightness.

Resolution ceilings prompt refocus on other features
Given the challenges of increasing image sensor resolution, camera phone manufacturers appear reluctant to further promote the feature. As a case in point, Apple’s advertising for the latest iPhone 5s doesn’t even mention resolution, instead focusing generically on image quality and other camera features. Many of these features leverage computational photography - using increasing processing power and sophisticated vision algorithms to make better photographs. After taking pictures, such advanced camera phones can edit them in such a way that image flaws – blur, low light, color fidelity, etc – are eliminated. In addition, computational photography enables brand new applications, such as reproducing a photographed object on a 3D printer, automatically labeling pictures so that you can easily find them in the future, or easily removing that person who walked in front of you while you were taking an otherwise perfect picture.

High Dynamic Range (HDR) is an example of computational photography that is now found on many camera phones. A camera without this capability may be hampered by images with under- and/or over-exposed regions. It can be difficult, for example, to capture the detail found in a shadow without making the sky look pure white. Conversely, capturing detail in the sky can make shadows pitch black. With HDR, multiple pictures are taken at different exposure settings, some optimized for bright regions of the image (such as the sky in the example), while others are optimized for dark areas (i.e. shadows). HDR algorithms then select and combine the best details of these multiple pictures, using them to synthesize a new image that captures the nuances of the clouds in the sky and the details in the shadows. This merging also needs to comprehend and compensate for between-image camera and subject movement, along with determining the optimum amount of light for different portions of the image (Figure 1).

Figure 1: An example of a high dynamic range image created by combining images captured using different exposure times

Another example of computational photography is Super Resolution, wherein multiple images of a given scene are algorithmically combined, resulting in a final image that delivers finer detail than that present in any of the originals. A similar technique can be used to transform multiple poorly lit images into a higher-quality well-illuminated image. Movement between image frames, either caused by the camera or subject, increases the Super Resolution implementation challenge, since the resultant motion blur must be correctly differentiated from image noise and appropriately compensated for. In such cases, more intelligent (i.e. more computationally intensive) processing, such as object tracking, path prediction, and action identification, is required in order to combine pixels that might be in different locations of each image frame.

Multi-image combination and content subtraction
Users can now automatically "paint" a panorama image simply by capturing sequential frames of the entire scene from top to bottom and left to right, which are subsequently "stitched" together by means of computational photography algorithms. By means of this technique, the resolution of the resultant panorama picture will far exceed the native resolution of the camera phone's image sensor. As an example, a number of highly zoomed-in still pictures can be aggregated into a single image that shows a detailed city skyline, with the viewer then being able to zoom in and inspect the rooftop of a specific building (Figure 2). Microsoft (with its PhotoSynth application), CloudBurst Research, Apple (with the panorama feature built into latest-generation iPhones), and GigaPan are examples of companies and products that expose the potential of these sophisticated panorama algorithms, which do pattern matching and aspect ratio conversion as part of the "stitching" process.

Figure 2: The original version of this "stitched" panorama image is 8 gigapixels in size, roughly 1000x the resolution of a normal camera, and allows you to clearly view the people walking on the street when you zoom into it.

Revolutionary capabilities derived from inpainting - the process of reconstructing lost or deteriorated parts of images and videos - involve taking multiple pictures from slightly different perspectives and comparing them in order to differentiate between objects. Undesirable objects such as the proverbial "photobomb," identified via their location changes from frame to frame, can then be removed easily (Figure 3). Some replacement schemes sample the (ideally uniform) area surrounding the removed object (such as a grassy field or blue sky) and use it to fill in the region containing the removed object. Other approaches use pattern matching and change detection techniques to fill in the resulting image "hole". Alternatively, and similar to the green-screen techniques used in making movies, you can use computational photography algorithms to automatically and seamlessly insert a person or object into a still image or real-time video stream. The latter approach is beneficial in advanced videoconferencing setups, for example.

Figure 3: Object replacement enables the removable of unwanted items in a scene, such as this "photobombing" seal.

Extrapolating the third dimension
The ability to use computational photography to obtain "perfect focus" anywhere in a picture is being pursued by companies such as Lytro, Pelican Imaging, and Qualcomm. A technique known as plenoptic imaging involves taking multiple simultaneous pictures of the same scene, with the focus point for each picture set to a different distance. The images are then combined, placing every part of the final image in focus – or not – as desired (Figure 4). As a byproduct of this computational photography process, the user is also able to obtain a complete depth map for 3D image generation purposes, useful for 3D printing and other applications.


Figure 4: Plenoptic cameras enable you to post-capture refocus on near (a), mid (b), and far (c) objects, all within the same image.

Homography is another way of building up a 3D image with a 2D camera using computational photography. In this process, a user moves the camera around and sequentially takes a series of shots of an object and/or environment from multiple angles and perspectives. The subsequent processing of the various captured perspectives is an extrapolation of the stereoscopic processing done by our eyes, and is used to determine depth data. The coordinates of the different viewpoints can be known with high precision thanks to the inertia (accelerometer, gyroscope) and location (GPS, Wi-Fi, cellular triangulation, magnetometer, barometer) sensors now built into mobile phones. By capturing multiple photos at various locations, you can assemble a 3D model of a room you're in, with the subsequent ability to virtually place yourself anywhere in that room and see things from that perspective. Google's recently unveiled Project Tango smartphone exemplifies the 3D model concept. The resultant 3D image of an object can also feed a 3D printer for duplication purposes.

< Previous
Page 1 of 2
Next >

Loading comments...