Achieving Optimized DSP Encoding for Video Applications
As digital video continues to extend visual communication to an ever-larger range of applications, more developers are becoming involved in creating new video systems or enhancing the capabilities of existing ones.
Among the basic design considerations video developers face is that the high degree of compression involved demands a high level of performance from the processor. In addition, the wide range of video applications requires performance to be optimized to meet system requirements that can vary widely in terms of transmission bandwidth, storage, image specifications and quality requirements.
Among the available solutions, programmable digital signal processors (DSPs) offer the high level of real-time performance required for compression, as well as flexibility that enables systems engineers to adapt the encoding software readily to individual applications.
The goal for video compression is to encode digital video using as few bits as possible while maintaining acceptable visual quality. While encoding algorithms are based on the mathematical principles of information theory, they often require implementation trade-offs that approach being an art form.
Well designed encoders can help developers make these trade-offs through innovative techniques and support of the options offered by advanced compression standards. A configurable video encoder that is designed to leverage the performance and flexibility of DSPs through a straightforward system interface can help systems engineers optimize their products easily and effectively.
Key factors in compression
Like JPEG for still images, the widely used ITU and MPEG video encoding algorithms can employ a combination of discrete transform coding (DCT), quantization and variable-length coding to compress macro-blocks within a frame (intra-frame).
Once the algorithm has established a baseline intra-coded (I) frame, a number of subsequent predicted (P) frames are created by coding only the difference in visual content or residual between each of them. This inter-frame compression is achieved using a technique called motion compensation in which the algorithm first estimates where the macro-blocks of an earlier reference frame have moved in the current frame, then subtracts and compresses the residual.
|Figure 1. Motion Compensation-Based Video Encoding|
Figure 1 above shows the flow of a generic motion compensation-based video encoder. The macro-blocks typically contain four 8×8-pixel luminance blocks and two 8×8-pixel chrominance blocks (YCbCr 4:2:0). The motion estimation stage, which creates motion vector (MV) data that describe where each of the blocks has moved, is usually the most computation-intensive stage of the algorithm.
Video compression standards specify only the bit-stream syntax and the decoding process, leaving a large scope for innovation within the encoders. For example, in the motion estimation stage, the ways that the motion vectors describe block movement are standardized, but there are no constraints on what techniques an encoder can use to determine the vectors.
Rate control is another area with significant latitude for innovation, allowing the encoder to assign quantization parameters and thus "shape" the noise in the video signal in appropriate ways.
In addition, the advanced H.264/MPEG-4 AVC standard adds flexibility and functionality by providing multiple options for macro-block size, quarter-pel (pixel) resolution for motion compensation, multiple-reference frames, bi-directional frame prediction (B frames) and adaptive in-loop de-blocking. Thus there is a large potential for trading off the various encoder options in order to balance complexity, delay and other real-time constraints.
|Figure 2. Frame Motion Vectors and Residual. A P frame (right) and its reference (left). Below the P frame, the residual (black) shows how little encoding remains once the motion vectors (blue) have been calculated.|
Surveillance and storage Encoding must be optimized in order to meet requirements that can vary enormously among applications. For instance, in surveillance, determining how to store the vast amount of visual information generated by networked cameras is one main problem.
One solution is to keep only the frames in which significant or suspicious activity occurs, such as someone entering or leaving through a secure door. The software that checks for this kind of detail relies on the compression algorithm for motion information. A high magnitude of the motion vectors indicates significant activity, and the frame is stored. Some, but not all, encoders provide access to motion vectors.
Differential encoding, the ability to encode at two different rates simultaneously, enhances the functionality of surveillance systems by allowing the system to display video at one rate on a monitor while storing it on disk at another rate. Also useful is an encoder system that can dynamically trade-off two low-quality channels with a single high-quality channel, enabling the system to select one camera feed over another when significant activity in the frame occurs.
Video conferencing and bandwidth
In video conferencing the most important issue is usually the transmission bandwidth, which can range from tens of kilobits per second up to multi-megabits per second. With some links the bit rate is guaranteed, but with the Internet bit rates are highly variable.
Video conferencing encoders may thus have to address the delivery requirements of different types of links and adapt in real time to changing bandwidth availability. The transmitting system, when it is advised of reception conditions via a reverse channel or RTCP acknowledgement, should be able to adjust its encoded output continually so that the best possible video is delivered with minimal interruption.
When delivery is poor, the encoder may respond by reducing its average bit rate, skipping frames, or changing the group of pictures (GoP), the mix of I and P frames. I frames are not as heavily compressed as P frames, so a GoP with fewer I frames requires less bandwidth overall. Since the visual content of a video conference does not change frequently, it is usually acceptable to send fewer I frames than would be needed for entertainment applications.
H.264 uses an adaptive in-loop de-blocking filter that operates on the block edges to ensure that motion estimation in future frames runs smoothly. The filter significantly improves the subjective quality of video encoded especially at low bit rates.
On the other hand, turning off the filter can increase the amount of visual data at a given bit rate, as can changing the motion estimation resolution from quarter-pel to half-pel or more. In some cases, it may be necessary to sacrifice the higher quality of de-blocking and fine resolution in order to reduce the complexity of encoding.
|Figure 3. Intra-Coded Strips in P Frames|
Since packet delivery via the Internet is not guaranteed, video conferencing often benefits from encoding mechanisms that increase error resilience. For instance, progressive strips of P frames can be intra-coded (I strips), as shown in Figure 3, above.
This technique eliminates the need for complete I frames (after the initial frame), reducing the risk that an entire I frame will be dropped and the picture broken up. Also, without the bursts created by I frames, the data flow is steadier.
There is a trade-off in compression, though, since the presence of I strips reduces the encoder's ability to exploit spatial redundancy. About two to five percent is lost in bit rate, so it is useful if the encoder can switch this capability on or off as needed for coping with network delivery conditions.
Mobile video requirements
In wireless phones with video capabilities, bandwidth is at a premium, even with 3G channels. Processing may also be more limited in these systems, since handsets are designed to trade off performance for low power consumption.
The encoder at the transmitting end has to take the receiving limitations into account, at the very least by adjusting the resolution to the small display. For video streaming, a lower frame rate with fewer I frames or I strips is likely, since the picture degradation is not as apparent as it would be on a larger display, while the bandwidth savings are considerable.
Video conferencing is an extreme case, since the handset needs to encode as well as decode. Because the background is usually static and there is little motion in the foreground, the configuration may be a single I frame at the beginning of the call followed by P frames only.
Buffering for recording For digital video recorders (DVRs), achieving the best trade-off of storage with picture quality can be difficult. Compression for video recording is delay-tolerant to some extent, so the output buffer can be designed for handling enough frames to keep a steady flow of data to the disk.
Under certain conditions, however, the buffer may become congested because the visual information is changing quickly and the algorithm is creating a large amount of P frame data. In this case, the encoder will trade off picture quality for a lower bit rate, and when the congestion has been resolved, increase the quality again.
A mechanism for performing this trade-off effectively is through rate control, such as changing the quantization parameter (Qp) on the fly. Quantization, as shown in Figure 1, is one of the last steps in the algorithm for compressing data.
Increased quantization reduces the bit rate output of the algorithm but creates picture distortion in direct proportion to the square of Qp. Thus the individual frames lose detail, but the likelihood of frame skips and picture break-up is reduced.
Also, since the visual content is changing rapidly, lower quality is likely to be less noticeable than it is when the content changes slowly. When the visual content returns to a lower bit rate and the buffer clears, Qp can be reset to its normal value.
Sweet spots An important consideration for optimization is the existence of "sweet spots" that offer the best trade-offs of bit rates, resolutions and frames per second (fps) for a given application.
For instance, for H.264 compression of DVD content, a typical encoder's optimal compression for bit rates below 128 kbps is 15-fps QCIF. Doubling the bit rate allows twice as many frames with the same resolution. At rates up to 1 Mbps, 30-fps CIF or QVGA may be optimal, and at higher rates 30-fps VGA or D1.
Video conferencing and surveillance, with their more static visual content, require less bandwidth for optimization. An encoder may optimize compression for these applications at the same resolutions and frame rates listed above but with bit rates some 30 percent lower.
Similarly, MPEG-4 encoding may push the bit rates some 30 percent higher in order to take advantage of mechanisms for higher quality such as quarter-pel resolution. A versatile encoder supports a number of such sweet spots in order to optimize video for many applications.
Since DSPs are used in a wide range of video applications, DSP encoders should be designed to take advantage of the flexibility inherent in compression standards.
An example can be found in the encoders that operate on Texas Instruments' OMAP media processors for mobile applications, TMS320C64x+ DSPs or processors based on DaVinci technology for digital video applications.
In order to maximize compression performance, each of the encoders is designed to leverage the DSP architecture of its platform, including the video and imaging coprocessor (VICP) that is designed into some of the processors.
Use of the encoders is straightforward, with a uniform set of application programming interfaces (APIs) for all versions. By default, API parameters are preset for high quality and another preset option is available for high speed encoding. Extended parameters adapt the application to either H.264 or MPEG-4.
The encoders support several options including YUV 4:2:2 and YUV 4:2:0 input formats, motion resolution down to a quarter-pel, I frame intervals ranging from every frame to none after the first frame, Qp bit rate control, access to motion vectors, de-blocking filter control, simultaneous encoding of two or more channels, I strips and other options.
The encoders dynamically and unrestrictedly determine the search range for motion vectors by default, a technique that improves on fixed-range searches.
System engineers need to be aware of the many differences that exist among entertainment, video conferencing, surveillance, mobile and other video applications.
Among these, widely varying requirements for transmission, storage, display and content force trade-offs between bit rate and picture quality, and the different ways of achieving these trade-offs make system optimization something of an art form. H.264/MPEG 4 AVC provides a number of options that can affect these trade-offs, and the standard does not specify how the encoder should determine data such as motion vectors.
Well designed encoders use this latitude to give systems engineers performance and flexibility in adapting compression to the requirements of applications in the rapidly expanding world of digital video.
Dr. Ajit Rao is the manager for multimedia codecs in Texas Instruments. In this role he is responsible for development and technical direction for TI's multimedia codecs products. Prior to joining TI three years ago, he was a lead codec developer at Microsoft and SignalCom. With more than seven years of codecs experience, Ajit holds four U.S. patents.