The previous article in this series addressed the rapid growth in photography and computer vision (the ability to extract information from an image) and examined how computational solutions are addressing limitations in emerging camera sensors and optics. In addition to creating better photographs under increasingly difficult conditions, these same solutions are also enabling new user interfaces and experiences, creating amazing pictures that were previously impossible, and extracting information from the images to enable better management of the many photographs we are taking.
These promising new features are being added despite some major challenges. The sensor pixel size is rapidly approaching the wavelength of light, leaving limited opportunity to reduce costs by further shrinking pixels, the fundamental building block of the image sensor. In addition, the increasing performance requirements of video and vision provide challenges for mobile phones and embedded solutions that are also being called upon to run more and more applications. This article looks at some of the emerging silicon architectures in the form of optimized and innovative processors and sensors that are enabling these advanced features.
Quantifying the challenge
Computer vision is a rich source for extracting information. At the same time, it is also complex to extract relevant information from that medium. Some neuroscientists have estimated that the human brain uses over 60% of its capacity for vision processing. There are numerous elements that contribute to vision's complexity, including the volume of data, the image pre-processing and cleanup of that data, image analysis, and resulting decision-making.
The sheer amount of data acquired in each still image and video stream is significant. Within the next two years, the outward facing cell phone camera will collect on average over 12 million pixels per image, according to industry experts (Figure 1 ). Many of these cameras also capture video at a rate of 15 frames per second at 12 million pixels per frame, or 30 frames per second at high definition resolutions (approximately 2.1 million pixels per frame), meaning that these cameras are generating nearly 200 million pixels per second.
Figure 1: Nearly half of all cell phones manufactured in 2014 are forecasted to have camera sensors that are at least 8 Mpixels in resolution (courtesy OmniVision Technologies).
Emerging mobile phones will soon have the ability to capture 4k UHD (ultra high definition) video, which is over 8 million pixels per frame at 30 frames – or more! – per second or at least 240 million pixels per second. Interest in higher frame rates to capture rapid movements and generate slow-motion video clips will almost certainly increase this data rate even further.
As these pixel values are collected, a fixed-pipeline ISP (image signal processor), either located within the image sensor or alongside it, performs image processing that is largely tied to resolving image quality issues related to the sensor and lens. A Bayer filter-pattern sensor's output data, for example, is interpolated in order to generate a full RGB data set for each pixel. The ISP also handles initial color and brightness adjustments (fine-tuned for each image sensor device), noise removal, and focus adjustment.
Once basic image processing is complete, vision processing algorithms then extract data from the image in order to perform functions such as computational photography, object tracking, facial recognition, depth processing, and augmented reality. As the functional requirements vary widely, so too will their algorithms and resulting computational load. Even a relatively simple algorithm could involve over a hundred calculations per pixel; at nearly 200 million pixels coming through the system each second, that is over 20 billion operations per second.
While powerful desktop and notebook processors operating at multi-gigahertz speeds can shoulder such a computing load, mobile phone processors are challenged to meet these vision performance requirements due to their slower clock speeds. In addition, many processor architectures frequently shuttle data in and out of cache and external memory; at the data rates required of video, this interface is a performance bottleneck and consumes significant power. In order to address these limitations, mobile processor architectures are taking advantage of acceleration engines and adding new engines specifically optimized for vision. At the same time, algorithm developers are optimizing and re-writing algorithms to run on these new engines.
New vision processing architectures
New vision acceleration engines are coming in the form of GPUs, DSPs, and specialized vision processors that are capable of significantly higher levels of parallel processing . These are often SIMD (single instruction multiple data) architectures that leverage the fact that many vision algorithms perform the same functions on groups of pixels.
Rather than performing the function serially on each pixel, such architectures process them in parallel, which reduces clock speed and dynamic power consumption (Figure 2 ). Additionally, these architectures are fine-tuned to minimize external memory accesses, enabling them to alleviate this performance bottleneck, achieve lower power consumption, and potentially lower the price of the chip package via reduced memory bus size requirements.
Figure 2: In an SIMD architecture, a single instruction is capable of working on multiple pieces of data in parallel, while a typical CPU handles data operations serially.
The increasing importance of the GPU core is reflected in its increasing percentage of the overall silicon area of a mobile application processor. Chipworks’ examination of Apple’s latest “A7” SoC , for example, reveals that a larger portion of the chip is dedicated to the GPU than to the CPU (Figure 3 ).
Figure 3: The GPU core and associated logic consume more silicon space than the CPU core in Apple's A7 SoC.
Similarly, in TechInsights' examination of Samsung’s Exynos Octa mobile processor, the GPU core was larger than the combination of the quad core ARM Cortex-A15 CPU and its surrounding L2 cache memory (Figure 4 ). While GPUs arguably exist to support the robust gaming and other graphics capabilities of today's mobile devices, these same cores are emerging as powerful engines for computational photography and other vision applications.
Figure 4: The GPU, ISP, camera and video logic take up nearly as much area as the ARM Cortex-A7 and Cortex-A15 CPUs and associated cache in Samsung’s Exynos Octa application processor.
As these applications become increasingly commonplace, optimized silicon blocks are increasingly emerging. These vision-optimized architectures have many efficient small processors that enable them to dissect an image and then process each of the resulting blocks in parallel. The companies creating these architectures also recognize the performance bottlenecks and increased power consumption that come with moving images in and out of memory and have developed approaches that eliminate unnecessary data movement.
NVIDIA, for example, introduced a computational photography acceleration engine called Chimera in the Tegra 4 SoC. Apple's advertising similarly claims that the A7 SoC contains a “new image sensor processor” enabling faster image capture, focus, and video frame rates.
Qualcomm, aggressively pushing camera features such as HDR, face detection, augmented reality and other vision capabilities in the company's Snapdragon processors , integrates a “Hexagon” DSP core to offload some vision processing functions from the CPU and ISP.
And the startup Movidius provides the vision and imaging coprocessor chip used in Google’s Project Tango as well as some other vision consumer products in development.
Companies whose business model involves licensing IP processor cores also recognize the parallel processing needs of these burgeoning vision applications and are responding. Many of ARM's CPU cores include the NEON SIMD processing “engine” , which is often used for vision processing.
Canadian company CogniVue provides licensable silicon IP core vision processor products for the automotive safety space. Core providers such as Apical, Tensilica (now part of Cadence), Tensilica), CEVA, Imagination Technologies, and videantis all now offer optimized cores for embedded vision that enable heavy vision processing while still fitting within the tight power budgets demanded by mobile system designs.Modern SoC architectures typically combine GPUs and SIMD processors withtraditional CPUs to create powerful vision processing platforms. Suchhybrid architectures typically use more specialized cores (such as GPUsand vision coprocessors) for parallel processing of video to extractrelevant objects and then use the CPU for identifying and ascertainingthe meaning of those objects, making complex decisions, and acting onthose decisions.
The Amazon Firefly shopping application is a good conceptual example (actual implementation unknown) of howimage recognition could take advantage of parallel processing engines ina GPU or SIMD architecture, while the actual shopping process may bebest suited for a traditional CPU. This partitioning of the total visiontask into core elements of video/vision processing and higher-levelcognitive decisions helps transform a compute- and power- intensiveproblem into a more economical and efficient lower-power solution.
New image sensor approaches
Therate of image sensor pixel size reduction is arguably slowing as thepixel size approaches the wavelength of visible light. What is notslowing down, however, are sensor technology innovations for vision. Forexample, emerging image sensors are moving beyond Bayer filter patternsand incorporating clear pixels that enable better image capture in lowlight conditions (Figure 5 ).
This improved low-lightcapability comes at a price, however: the initially captured colorinformation is less precise. Specifically, the green filters thatdominate Bayer arrays, and whose portion of the visible spectrum thehuman eye is particularly sensitive to, are absent in some of theseleading-edge approaches. Additional computation is therefore required toresolve accurate per-pixel color detail.
Figure5: A traditional Bayer pattern image sensor focuses its data precisionon the critical green frequency spectrum band, but doesn't offer thelow-light performance of emerging alternative filter schemes (courtesyAptina Imaging).
Other image sensor architecturesare being explored that augment or replace red/green/blue filters withpolarized filters; this can reduce glare or improve contrast. Stillother architectures are being explored that add time-of-flight pixels inaddition to the red/green/blue Bayer pixels; time-of-flight pixels areused to determine the distance of an object to the camera, which isextremely useful for identifying objects and understanding their shape,size, and location, as well as to assist in focusing the camera.
Innovationsare also underway to increase the frame capture rate to enableslow-motion photography and effective tracking of rapidly movingobjects. This greater capture rate coupled with higher per-frameresolution is rapidly increasing the amount of data that sensors musttransfer, which can decrease battery life and create noisier electricalenvironments that degrade image quality. To address these concerns,manufacturers are developing new bus interfaces, such as CSI-3 (thethird-generation Camera Serial Interface) from the MIPI Alliance, whichpromises to increase data rates while simultaneously decreasing powerconsumption and not adversely impacting image quality.
As pixelsizes shrink, image sensor manufacturers are responding by includingconsiderably more logic on their devices, in order to create betterpictures through on-chip digital signal processing. For example, inaddition to managing the challenges of the previously discussed “clearpixel” and other emerging filter array structures, image sensorsuppliers are exploring techniques to adjust exposure times atfiner-grained levels than an entire frame.
Specifically, addedsensor “intelligence” discerns when additional or decreased exposuretime may be needed for groups of pixels within the frame, thereby givingeach area of the picture an optimized exposure. The desired end resultis an overall higher dynamic range for the image, without need for the multi-image capture andpost-processing used in traditional high dynamic range implementations.
Othervision sensor architectures are being examined that won’t send theactual image, but rather will send the image's meta-data. In such anarchitecture, object recognition would occur on the chip and only theextracted data would be sent off the sensor.
This processed datacould, for example, be a compact histogram of the image. While thishistogram data might have limited value for a person, it is usedextensively in some vision algorithms for things like image matching. Byonly sending the processed image, data rates are reduced, which paysdividends of lower processor speeds, slower memory busses, and lowernetwork data rates when image information is sent into the cloud oracross a network.
Adding logic to a sensor can increase heatgeneration and operating noise, however, which often degrades imagequality. One workaround being evaluated is to use stacked die, where thebottom die contains the high-speed digital processing elements and thetop die encompasses the pixels that collect light (Figure 6 ).
Figure6: A stacked die scheme could allow for the cost-effective andotherwise attractive combination of a conventional image sensor andsubstantial amounts of processing logic and/or local storage memory(courtesy Sony Semiconductor).
Such “stacked” technology is complementary to today's increasingly common backside illumination approach ,where the sensor is inverted and light is captured by the chip's”backside”. In order to enable the light to penetrate the image sensordie, the wafer is thinned prior to being diced into chips, and anotherwafer is attached to it to provide additional structural integrity. Thisadditional piece of silicon can also implement logic for digital signalprocessing capabilities.
The stacked silicon approach alsooffers the potential to create smaller die size and lower cost “globalshutter” sensors. A global shutter image sensor collects the values forall pixels and simultaneously transfers their data at a common moment intime; the global shutter differs from a “rolling shutter” where data isserially transferred off the sensor over time, rather than all at once.While the rolling shutter is a simpler architecture and enables asmaller, low-cost sensor, it distorts rapidly moving objects (Figure 7 ).
Figure7: The “rolling shutter” artifacts often found when capturing imagescontaining fast-moving subjects using conventional CMOS image sensorsare not encountered with the alternative “global shutter” approach(courtesy Aptina Imaging).
In vision applicationswhere a core requirement is identifying objects, global shutter sensorsare often necessary to remove or totally avoid these distortions andartifacts. Existing global shutter sensors add local memory adjacent tothe image sensor to store the pixel data at a common point in time untilit's ready to download. With “stacked” image sensors, local memory forglobal shutter sensors can be moved under the image sensor, reducing theoverall die size and cost of the sensor.
Closing tthoughts and additional insights
The”perfect storm” of robust vision processor and image sensor technologyinnovations, tremendous processor performance, and strong marketinterest provides an exciting environment to drive tremendous growth ofcomputational photography and computer vision in the coming years.Numerous startups are emerging from academia, where specialized visiontechnology has been incubating over the last twenty-plus years, andelsewhere.
With the emergence of new processing architectures,both within mobile devices as well as in the “cloud”, powerfulvision-processing platforms now exist. They are enabling new algorithmsand applications that will drive new capabilities and new growth, in asustaining cycle that is likely to continue for quite some time.
Michael McDonald is President of Skylane Technology Consulting, which provides marketingand consulting services to companies and startups primarily in vision,computational photography, and ADAS markets. He has over 20 years ofexperience working for technology leaders including Broadcom, Marvell,LSI, and AMD.. He has a BSEE from Rice University.
Notes on the Embedded Vision Alliance: To learn more about computational photography, computer vision, or themany applications they enable, use the various resources provided by theEmbedded Vision Alliance ,which has the mission to provide engineers with practical education,information, and insights to help them incorporate vision capabilitiesinto new and existing products in the form of tutorial articles, videos,code downloads and a discussion forum. Registered website users canalso receive the Alliance’s email newsletter , among other benefits.
In addition, the Embedded Vision Alliance offers a free online training facility for embedded vision product developers: the Embedded Vision Academy ,which provides in-depth technical training and other resources to helpengineers.. Course material in the Embedded Vision Academy spans a widerange of vision-related subjects, from basic vision algorithms to imagepre-processing, image sensor interfaces, and software developmenttechniques and tools such as OpenCV. Access is free to all who register.
The Alliance also holds Embedded Vision Summit conferences which are technical educational forums for product creatorsinterested in incorporating visual intelligence into electronic systemsand software. The most recent Embedded Vision Summit was held in May2014, and an archive of keynote, technical tutorial and productdemonstration videos and presentation slide sets, is available on the Alliance website . The next Embedded Vision Summit will take place on April 30, 2015 in Santa Clara, California.