Video codecs, part 3: H.264 & video over networks
Part 3 details the operation of H.264/AVC and discusses issues involved in transmitting video over networks.
By John W. Woods
DSP DesignLine
(07/23/08, 12:00:00 PM EDT)
This series is excerpted from "Multidimensional Signal, Image, and Video Processing and Coding." Order this book today at www.elsevierdirect.com or by calling 1-800-545-2522 and receive an additional 20% discount and free shipping. Use promotion code 92004 when ordering. Valid only in North America.

Part 2 looks at interframe coding, MPEG-2 & MPEG-4. Part 4 looks at Wavelet codecs.


11.3.5 Video Processing of MPEG-Coded Bitstreams
Various video processing tasks have been investigated for coded data and MPEG 2 in particular. In video editing of MPEG bitstreams, the question is how to take two or more MPEG input bitstreams and make one composite MPEG output bitstream. The problem with decoding/recoding is that it introduces artifacts and is computationally demanding. Staying as much in the MPEG 2 compressed domain as possible and reusing the editing mode decisions and motion vectors have been found essential. Since the GOP boundaries may not align, it is necessary to recode the one GOP where the edit point lies. Even then original bitrates may make the output video buffer verifier (VBV) overflow. The solution is to requantize this GOP and perhaps neighboring GOPs to reduce the likelihood of such buffer overflow. The recent introduction of the HDV format, featuring MPEG 2 compression of HD video in camera, brings the MPEG bitstream editing problem to the forefront. Many software products are emerging and promise to edit HDV video in its so-called native mode.

The transcoding of MPEG question is how to go from a high-quality level of MPEG to a lower one, without decoding and recoding at the desired output rate. Transcoding is of interest for video-on-demand (VoD) applications. Transcoding is also of interest in video production work where short GOP (IPIP) may be used internally for editing and program creation, while the longer IBBPBBP··· long GOP structure is necessary for program distribution. Finally, there is the worldwide distribution problem, where 525 and 625 line systems3 continue to coexist. Here a motion-compensated transcoding of the MPEG bitstream is required. For more on transcoding, see Chapter 6.3 in Bovik's handbook [48].

11.3.6 H.263 Coder for Visual Conferencing
The H.263 coder from the ITU evolved from their earlier H.261, or px64 coder. As the original name implies, it is targeted at rates that are a multiple of 64 Kbps. To achieve such low bitrates, they resort to a small QCIF frame size, and a variable and low frame rate, with bitrate control based on buffer fullness. If there is a lot of motion in detailed areas that generate a lot of bits, then the buffer fills and the frame rate is reduced, i.e., frames are dropped at the encoder. The user can specify a target frame rate, but often the H.263 coder at, say 64 Kbps, will not achieve a target frame rate of 10 fps. The H.263 coder features a group of blocks (GOB) structure, rather than a GOP, with I blocks being inserted randomly for refresh. While there are no B frames, there is the option for so-called PB frames. The coder has half-pixel accurate motion vectors likeMPEG 2, and can use overlapped motion vectors from neighboring blocks to achieve a smoother motion field and reduced blocking artifacts. Also, there is an advanced prediction mode option and an arithmetic coder option.

The reason for targeting the GOB structure versus the GOP structure is the need to avoid the I frames in the GOP structure, because they require a lot of bits to transmit, a difficulty in videophone, which H.263 targets as a main application. In videophone, low bitrates and short latency requirement (≤ 200 msec) mitigate against the bit-hungry I frames. As a result, in H.263, slices or GOBs are updated by I slices randomly, thus reducing the variance on coded frame sizes that occurs with the GOP structure. High variance of bits/frame is not a problem in MPEG 2 because of its targeted entertainment applications, such as video streaming, including multicasting, digital broadcasting, and DVD. Some low bitrate H.263 coding results are contained on the enclosed CD-ROM.

11.3.7 H.264/AVC
Research on increasing coding efficiency continued through the late 1990s and it was found that up to a 50% increase in coding efficiency could be obtained by various improvements to the basic hybrid coding approach of MPEG 2. Instead of using one hypothesis for the motion estimation, multiple hypotheses from multiple locations in past frames could be used [39] together with an optimization approach to allocate the bits. By 2001, the video standards groups at ITU, Video Coding Experts Group (VCEG), and ISO MPEG convened a joint video team (JVT) to work on the new standard, to be called H.264 by ITU and MPEG 4, part 10 by the ISO. The common name is Advanced Video Coder (AVC). With reference to Figure 11-17., we see that more frame memory to store past frames has been added to the basic hybrid coder of Figure 11-9. We also see the addition of a loop filter that serves to smooth out blocking artifacts. Further, before the intra transform, there is intra, or directional spatial prediction (explained below), hence the need to switch between intra and inter prediction modes as seen in Figure 11-17.


(Click to enlarge)

Figure 11-17. System diagram of the H.264/AVC coder.

Footnotes
3. The reader should note that the 525 line system only has 486 visible lines, i.e., it is D1, which is 720× 486. A similar statement is true for the 625 line system. The remaining lines are hidden!

There are many new ideas in H.264/AVC that allow it to perform at almost twice the efficiency of the MPEG 2 standard, also known as H.262. There is a new variable blocksize motion estimation, with blocksizes ranging from 16 × 16 down to 4 × 4, and motion vector accuracy raised to one-quarter pixel from the half-pixel accuracy of MPEG 2. The permitted blocksize choices are shown in Figure 11-18. The 16 × 16 macroblock can be split in three ways to get 16 × 8, 8 × 16, or 8 × 8, as shown. If the accuracy of 8 × 8 blocks is not enough, one more round of such splitting finally results in the sub-macroblocks 8 × 4, 4 × 8, or 4 × 4. Note that in addition to what we would get by quadtree splitting (cf. Chapter 10), we get the possible rectangular blocks, which can be thought of as a level inserted between two quadtree square block levels.


Figure 11-18. Allowed motion vector blocksizes in H.264/AVC.

To match this smallest blocksize, a 4×4 integer-based transform is introduced that is DCT-like, and separable using the 1-D four-point transform


The H.264/AVC coder is based on slices, with I, B, and P slices, as well as two new switching slices SP and SI. The slices are in turn made up of 16 × 16 macroblocks. The P slice can have I or P macroblocks. The B slice can have I, B, or P macroblocks. There is no mention of group of pictures, but there are I pictures, needed at the start to initialize this hybrid coder. There is nothing to prohibit a slice from being the size of a whole frame, so that there can effectively be P and B pictures as well.

In H.264/AVC, an I slice is defined as one whose macroblocks are all intracoded. A P slice has macroblocks that can be predictively coded with up to one motion vector per block. A B slice has macroblocks predictively (interpolatively) coded using up to two motion vectors per block. Additionally, there are new switching slices SP and SI that permit efficient jumping from place to place within or across bitstreams (cf., Section 12.3, Chapter 12).

Within an I slice, there is intrapicture prediction, done blockwise based on previously coded blocks. The intra prediction block error is then subject to the 4×4 integer-based transform and then quantization. The intra prediction is adaptive and directional based for 4 × 4 blocks, as indicated in Figure 11-19. A prespecified fixed blending of the available boundary values is tried in each of the eight prediction directions. Also available is a so-called DC option that predicts the whole block as a constant. For 16 × 16 blocks, only four intra prediction modes are possible: vertical, horizontal, DC, and plane, the last one coding the values of a best-fit plane for the macroblock. The motion compensation can use multiple references, so, for example, a block in a P slice can be predicted by one to four reference blocks in earlier frames. The H.264/AVC standard specifies the amount of reference frame storage that must be available at the decoder to store these past pictures, and five past frames is common.


(Click to enlarge)

Figure 11-19. Illustration of directional prediction modes of H.264/AVC in the case of 4 × 4 blocks.

Figure 11-20 illustrates the comparative PSNR versus bitrate performance of the verification models of H.264/AVC on the 15-fps CIF test clip Tempete (HLP, High-Latency Profile; ASP, Advanced Simple Profile; MP, Main Profile). The figure [40] shows considerable improvement over MPEG 2, of about a factor of 2 in creased compression at fixed PSNR. This improvement in compression efficiency is largely due to the greater exploitation of motion information made possible by the revolutionary increases in affordable computational power of the past 10 years. More information on the new H.264/AVC standard is available in the review article by Wiegand et al. [43], which introduces a special issue on this topic [38].


Figure 11-20. PSNR vs. bitrate for 15-fps CIF test clip Tempete. Reprinted with permission from Sullivan and Wiegand. [40]
11.3.8 Video Coder Mode Control
A video coder like H.264/AVC has many coder modes to be determined. For each macroblock, there is the decision of inter or intra, and quantizer step-size, as well as motion vector blocksize. Each choice (mode) translates into a different number of coded bits for the macroblock. We can write the relevant Lagrangian form as


where Dk and Rk are the corresponding mean-square distortion and rate for block bk when coding in mode mk, and Q is the quantizer step-size parameter. Here Rk must include the bits for the transformed block plus bits for the motion vector(s) and mode decision, the latter usually being negligible for 16 × 16 macroblocks. For the moment assume that the motion vectors have been determined prior to this optimization. Then we sum over all the blocks in the frame to get the total distortion and rate for that frame,


where D and R are the frame's total distortion and rate. Normally the blocks in a frame are scanned in the 2-D raster manner, i.e., left to right and then top to bottom. The joint optimization of all the modes and Q values for these blocks is a daunting problem. In practical coders, often what optimization there is, is done macroblock by macroblock, wherein the best choice of mode mk and Q is done for block bk conditional on the past of the frame in the NSHP sense, resulting in at most a stepwise optimal result. We can achieve this stepwise or greedy optimal result by evaluating Jk in (11.3-2) for all the modes and Q and then choosing the lowest value. This point will then be on the optimized D–R curve for some rate. To generate the entire curve, we sweep through the parameter &lambdamode. Now, in the test model for H.263, an experimentally observed relation is used, that has been theoretically motivated in the high-rate Gaussian case [39],


with the value c = 0.85, approximately on the experimental D–R curve. Therefore, the blockwise optimization in the test model for H.263 then becomes: for each value of quantization parameter Q, choose the mode mk such that


A somewhat different approximation for &lambdamode= ƒ(Q) is used in the H.264/AVC test model. In both cases, this then results in a sequence of macroblocks that is optimized in the so-called constant-slope sense. To actually get a CBR coded sequence, some kind of further rate control has to be applied to decide what value of Q to use for each macroblock in each frame. An easy VBR case results from the choice of constant Q, but then the total bitrate is unconstrained. Also, in practice, different Q values are used for I, P, and B slices or frames, with step-size increasing in a fixed manner. This choice also is usually fixed and not optimized over. Extension of this method to include the needed optimization of the motion vector bitrate in a variable blocksize motion field is contained in Sullivan and Wiegand [39].

11.3.9 Network Adaptation

In H.264/AVC, there is a separation of the video coding layer (VCL), which we have been discussing, from the actual transport bitstream. As in MPEG 4 video, H.264/AVC is intended to be carried on many different networks with different packet or cell sizes and qualities of service. So MPEG simply provides a network abstraction layer (NAL) specification about how to adapt the output of the VCL to any of several transport streams, including transport onMPEG 2, ATM, and IP networks. (In Chapter 12, we discuss further some basic issues in network video.)

Related articles

Part 4 looks at Wavelet codecs.


Printed with permission from Newnes Press, a division of Elsevier. Copyright 2006. "Multidimensional Signal, Image, and Video Processing and Coding" By John Woods. For more information about this title and other similar books, please visit www.elsevierdirect.com.