Creating video that mimics human visual perception

Recent significant breakthroughs in core video processing techniques have nurtured video technology into one that looks aptly placed to contest the capabilities of the human visual system. For one, the last couple of decades have witnessed a phenomenal increase in the number of pixels accommodated by display systems, enabling the transition from standard-definition (SD) video to high-definition (HD) video.

Another noteworthy evolution is the stark enhancement in pixel quality, characterized by high dynamic range (HDR) systems as they elegantly displace their low dynamic range (LDR) equivalents.

Moreover, the intuitive approaches developed in the understanding of images to replicate the perceptual abilities of the human brain have met with encouraging successes, as have 3D video systems in their drive toward a total eclipse of their 2D counterparts.

These advanced techniques coerce toward a common purpose—to ensure the disappearance of boundaries between the real and digital worlds, achieved through the capture of videos that mimic the various aspects of human visual perception. These aspects fundamentally relate to video processing research in the fields of video capture, display technologies, data compression as well as understanding video content.

Video capture in 3D, HD and HDR
The two distinct technologies used in the capture of digital videos are the charge-coupled devices (CCD) and complementary metal-oxide-semiconductor (CMOS) image sensors, both of which convert light intensities into appropriate values of electric charges to be later processed as electronic signals.

Leveraging on a remarkable half-century of continued development, these technologies enable the capture of HD videos of exceptional quality. Nevertheless, in terms of HDR videos, these technologies pale in comparison to the capabilities of a typical human eye, itself boasting a dynamic range (the ratio of the brightest to darkest parts visible) of about 10000:1.

Existing digital camcorders can either only capture the brighter portions of a scene using short exposure durations or the darker portions using longer exposure durations.

Practically, this shortcoming can be circumvented with the use of multiple camcorders with one or two beam splitters, in which several video sequences are captured concurrently under different exposure settings.

Beam splitters allow for the simultaneous capture of identical LDR scenes, the best portions of which are then used to synthesize HDR videos. From a research perspective, the challenge is to achieve this feat of a higher dynamic range with the use of a single camcorder, albeit with an unavoidable but reasonable reduction in quality that is insignificantly perceivable.

Moreover, it is envisioned that HDR camcorders equipped with advanced image sensors may serve this purpose in the near future.

3D capture technologies widely employ stereoscopic techniques of obtaining stereo pairs using a two-view setup. Cameras are mounted side by side, with a separation typically equal to the distance between a person's pupils.

Exploiting the idea that views from distant objects arrive at each eye along the same line of sight, while those from closer objects arrive at different angles, realistic 3D images can be obtained from the stereoscopic image pair.

Multi-view technology, an alternative to stereoscopy, captures 3D scenes by recording several independent video streams using an array of cameras. Additionally, plenoptic cameras, which capture the light field of a scene, can also be used for multiview capture with a single main lens. The resulting views can then either be shown on multiview displays or stored for further processing.

Compressing 3D, HD and HDR videos
Transmitting 3D, HD and HDR videos as they are captured imposes impractical bandwidth requirements on transmission systems. For example, an uncompressed 2-megapixels 2D video captured at 60 frames per second (fps) would require nearly 2Gbps—twice the highest bandwidth currently available on OpenNet.

For HDR video, each pixel can be represented by a 96bit floating point number; therefore, an uncompressed 2-megapixels 2D HDR video at 60fps would require nearly 12Gbps. Captured video data must thus be efficiently compressed to ensure practical transmission, as investigated in the field of video coding.

Video compression hinges primarily on two intuitive concepts. First, the notion that successive raw video frames are highly similar implies that a huge amount of redundant information exists between them. Next, redundancies exist even within frames themselves, as depicted by real life scenes in which there is a high likelihood that pixels in the vicinity of one another have similar values.

Removal of these redundancies by means of advanced coding techniques give rise to video formats which greatly reduce the amount of data needed for transmission and storage. Moreover, advanced predictive techniques can be used during decompression of videos to preserve high video fidelity despite substantial compression.

Research on improving the compression efficiency of 2D video coding has been extensively conducted in the last few decades. As a measure of performance, the amount of bandwidth required to transmit the same video quality at the same resolution has gradually decreased.

Much of this is attributed to the progressive development of video coding standards such as MPEG-1, MPEG-2, MPEG-4 and H.264/AVC, adhered to so that interoperability between different devices is guaranteed. For example, MPEG-2 is used to code videos in the DVD format, ensuring that any DVD can be viewed on any standard-compliant DVD players.

One of the state-of-the-art video coding standards widely used today is H.264/AVC. The development of this standard, which included technical contributions by the Institute for Infocomm Research (I2R), culminated in its endorsement in 2003.

Thereafter, the standard quickly gained global acceptance for deployment in numerous consumer devices as well as internet videos. Despite enjoying such achievement, I2R remains devoted in its endeavour to further develop and commercialize even more impactful industry solutions based on this noteworthy contribution.

In response to consumers' incessant demands for video content of even higher resolutions, the standards bodies ISO/IEC and ITU-T have identified the need for a new video coding standard that further improve compression efficiency to enable the transmission and storage of high resolution content.

The Joint Collaborative Team-Video Coding (JCT-VC) was formed to coordinate the “high-efficiency video coding” (HEVC) effort of realizing a video coding standard able to provide a 50 percent reduction in the required video coding bitrate, in relation to H.264/AVC. The inaugural meeting was held in April 2010, with I2R's involvement highlighting a keen desire to pervasively contribute to this initiative.

For HDR videos, compression can be achieved by increasing the precision of input pixels. Traditionally, video compression techniques have considered only 8bits-per-pixel (bpp) input videos, yet HDR videos require at least 10–14bpp.

Consequently, the H.264/AVC standard was amended to allow for the input of HDR videos of up to 14bpp. The HEVC standard currently being designed is also expected to provide the necessary capabilities to handle HDR videos of up to 14bpp.

In the 3D realm, however, the choice of compression technique hinges on the way these videos are represented. Of these, one approach represents a 3D video as a collection of multiple 2D views, as how a stereoscopic video consists of the left and right view.

Here, an apparent compression scheme involves independently compressing each view in an approach known as “simulcast.” Nevertheless, significant overlap might exist between these views, and the amount of redundancy contained within can be exploited for further compression.

The display of 3D, HD and HDR videos
Most existing display devices can already display HD videos, but technologies for 3D and HDR videos have not enjoyed similar successes despite the increasing availability of 3D television sets in the market.

In HDR technology, the gap between the dynamic ranges of display devices and that of real scenes prevent the display of such videos using existing monitors. Other than attempting to preserve the “feel” of the scenes while compressing the dynamic range of HDR videos, devices can be designed to directly represent HDR videos. As in the case of HDR capture technologies, the foreseeable future will correspondingly witness the emergence of HDR monitors, printers and other output devices.

For 3D videos, however, it is imperative to note that their displays involve projecting stereoscopic images separately to each eye. To this end, two strategies have been adopted. First, 3D glasses are used to project the offset images to the respective eyes.

For example, anaglyphic 3D uses passive red-cyan lenses while polarization 3D employs the use of passive polarized lenses. Second, instead of the user having to wear glasses, the display device itself assumes the responsibility of projecting the appropriate stereoscopic images into the viewer's eyes in a display technology termed auto-stereoscopy.

Additionally, while single-view displays project only one stereo pair at a time, multiview displays exploit the use of head tracking devices to change the view depending on the viewer's head position and viewing angle. In the special case of auto-multiscopic displays, multiple independent views of a scene are projected to a number of viewers.

These views are created instantly using the above-mentioned “2D views with depth information” approach. Various other display techniques such as holography, volumetric display and the Pulfrich effect also exist.

While 3D video provides more realistic visual experience to viewers, this enhancement comes at a price, which ranges from minor fatigue to severe headaches when watching the video. Many scientists and researchers are currently working on this problem, and solutions that eliminate or reduce this discomfort will soon be available.

Understanding 3D, HD and HDR videos
The technology of video content search allows viewers to identify segments of interest within a video. Typically, good video search engines consist of two major modules, namely video content analysis (VCA) and query optimization.

VCA refers to the ability to automatically identify objects and events in videos. One key component of VCA is concept detection, essentially a classification task to predict the presence of certain concepts in various portions of the video.

Here, video shots or key frames are annotated automatically with regards to semantic concepts. These include objects and scenes, serving as good intermediate features for video content indexing and understanding.

While the overall accuracy for concept detection remains somewhat unsatisfactory, significant breakthroughs have been made in the detection of several concepts including faces and prominent landmarks.

Notably, these methods of concept detection are only effective in modeling a limited set of concepts, and as such, future directions to supplement these include the design of universal concept detectors that are able to recognize situations despite the absence of prior exemplars.

The other module, termed “query optimization,” aims to understand the search intent of users out of textual or multimedia queries. To satisfy user intent, multiple query plans are examined until an appropriate one is identified.

The multitude of ways of identifying suitable query plans makes it impossible to champion an absolute best strategy. In fact, consideration has to be given to the tradeoffs pertaining to the amount of time devising the best plan and the amount of time required to run the plan itself.

Despite the plethora of challenges faced in this increasingly complex field, I2R has a state-of-the-art query optimization method, in which a set of features including surface and syntactic patterns are leveraged on, in order to map search queries into distinct video modalities.

The enabling technologies of video capture, display, compression and understanding have undergone astounding developments over the last few decades and their ultimate aim of realism in the perception of digital images has been well defined.

With these in place together with the upcoming breakthroughs expected for the foreseeable future, it might not be long before we begin using machines that can perfectly reproduce everything we see and at the same time be able to understand the semantics.

Susanto Rahardja is Deputy Executive Director, Agency for Science, Technology and Research in Singapore.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.