How to map the H.264/AVC video standard onto an FPGA fabric -

How to map the H.264/AVC video standard onto an FPGA fabric

Despite its promise of improved coding efficiency over existingstandards, H.264/AVC still presentsengineering challenges.

It incorporates the most significant changes andalgorithmic discontinuities in the evolution of video coding standardssince the introduction of H.261. As a result, to achieve a real-timeH.264/ AVC encoding solution, multiple FPGAs and programmable DSPs are often used.

To illustrate the computational complexity required, let's explorethe typical runtime-cycle requirements of the H.264/AVC encoder basedon the software model provided by the Joint Video Team (JVT). UsingIntel VTune software running on a Pentium III 1GHz CPU with 512Mbytes of memory,achieving H.264/AVC SD with a main profile encoding solution wouldrequire about 1,600 billions of operations per second.

Figure1. Data locality is primarily dictated by the physical interfacesbetween the data unit and processing engine.

However, computational complexity alone does not determine if afunctional module should be mapped to hardware or remain in software.To evaluate the viability of software and hardware partitioning, weneed to look at a number of architectural issues that influence theoverall design decision, including:

Data locality. Ina synchronous design, the ability to access memory in a particularorder and granularity while minimizing the number of clock cycles dueto latency, bus contention, alignment, DMA transfer rate and the typesof memory used is very important. The data locality issue (Figure 1, above ) is primarilydictated by the physical interfaces between the data unit and thearithmetic unit (or the processing engine).

Dataparallelism. Most signal processing algorithms operate on datathat is highly parallelizable. Single instruction multipledata (SIMD) and vector processors are particularlyefficient for data that can be parallelized or made into a vectorformat (or long data width).

Signalprocessing algorithm parallelism. In a typical programmable DSPor a general- purpose processor, this is often referred to asinstructionlevel parallelism (ILP). A VLIW processor is an example ofsuch a machine that exploits ILP by grouping multiple instructions(ADD, MULT and BRA) to be executed in a single cycle. Aheavily-pipelined execution unit in the processor is another example.

Computationalcomplexity. Programmable DSPs are bounded in computationalcomplexity, as measured by the clock rate of the processor. Signalprocessing algorithms implemented in the FPGA fabric are typicallycomputationally- intensive.

By mapping these modules onto the FPGA fabric, the host processor orthe programmable DSP has the extra cycles for other algorithms.

Furthermore, FPGAs can have multiple clock domains in the fabric, soselective hardware blocks can have separate clock speeds based on theircomputational requirements.

Theoreticoptimality in quality . Any theoretic optimal solution based onthe rate-distortion curve can be achieved if and only if the complexityis unbounded. In a programmable DSP or general-purpose processor, thecomputational complexity is always bounded by the clock cyclesavailable.

FPGAs, on the other hand, offer much more flexibility by exploitingdata and algorithm parallelism by means of multiple instantiations ofthe hardware engines, or increased use of block RAM and register banksin the fabric.

Figure2. The H.264/AVC standard can predict values of the content of apicture to be encoded by exploiting pixel redundancy.

Improved prediction
Some of the main features of the H.264/AVC video coding standard design(Figure 2, above ) that enableits enhanced coding efficiency are:

Quarter-pixel-accurate motion compensation. Prior standards usehalf-pixel motion vector accuracy. The new design improves on this byproviding quarterpixel motion vector accuracy. Prediction values atquarterpixel positions are generated by averaging samples at the full-and half-pixel positions.

These sub-sampling interpolation operations can be efficientlyimplemented in hardware inside the FPGA fabric.

Variableblock-sized motion compensation with small block size. Thestandard provides more flexibility for the tiling structure in amacroblock size of 16 x 16 pixels. It allows the use of 16 x 16, 16 x8, 8 x 16, 8 x 8, 8 x 4, 4 x 8 and 4 x 4 sub-macroblock sizes.

Because of the increasing combinations of tiling geometry with agiven 16 x 16 macroblock, to find a rate distortion optimal tilingsolution is computationally-intensive.

This additional feature places a burden on the computational enginesused in motion estimation, refinement and mode decision process.

In-the-loopadaptive deblocking filtering. The deblocking filter has beensuccessfully applied in H.263+ and MPEG- 4 part 2 implementations as apost-processing filter.

In H.264/AVC, the deblocking filter is moved inside themotion- compensated loop to filter block edges resulting from theprediction and residual difference coding stages of the decodingprocess.

The filtering is applied on both 4 x 4 block and 16 x 16 macroblockboundaries, in which two pixels on either side of the boundary may beupdated using a three-tap filter. The filter coefficients or “strength”are governed by a content-adaptive nonlinear filtering scheme.

Directionalspatial prediction for intracoding. In cases where motionestimation cannot be exploited, intradirectional spatial prediction isused to eliminate spatial redundancies.

This technique attempts to predict the current block byextrapolating the neighboring pixels from adjacent blocks in a definedset of directions.

The difference between the predicted block and the actual block isthen coded. This approach is particularly useful in flat backgroundswhere spatial redundancies exist.

Multiplereference picture motion compensation. The H.264/AVC standardoffers the option for multiple reference frames in the interframecoding. Unless the number of the referenced pictures is one, the indexat which the reference picture is located inside the multipicturebuffer has to be signaled.

The multipicture buffer size determines the memory usage in theencoder and decoder. These reference frame buffers must be addressedcorrespondingly during the motion estimation and compensation stages inthe encoder.

Weightedprediction. The JVT recognizes that in encoding certain videoscenes that involve fades, having a weighted motion- compensatedprediction dramatically improves the coding efficiency.

Enhanced coding efficiency
In addition to improved prediction methods, other parts of the standarddesign were also enhanced, including the following:

Smallblocksize, hierarchical, exact-match inverse and short word-length transform. TheH.264/AVC, like other standards, also applies transform coding to themotion-compensated prediction residual. But unlike previous standardsthat use an 8 x 8 discrete cosine transform (DCT), this transform isapplied to 4 x 4 blocks, and is exactly invertible in a 16bit integerformat.

The small block helps reduce blocking and ringing artifacts, while theprecise integer specification eliminates any mismatch issues betweenthe encoder and decoder in the inverse transform.

Furthermore, an additional transform based on the Hadamard matrixis also used to exploit the redundancy of 16 DC coefficients of thealready transformed blocks.

Compared to a DCT, all applied integer transforms have only integernumbers ranging from -2 to 2 in the transform matrix. This allows youto compute the transform and the inverse transform in 16bit arithmeticusing only low-complexity shifters and adders.

Arithmetic and context-adaptiveentropy coding. Two methods of entropy coding exist: alow-complexity technique based on the use of Cavlc and thecomputationally- more-demanding algorithm of Cabac.

Cavlc is the baseline entropycoding method of H.264/AVC. Its basic coding tool consists of a singleVLC of structured Exp-Golomb codes, which by meansof individually customized mappings are applied to all syntax elementsexcept those related to the quantized transform coefficients.

For the Cabac, a more sophisticatedcoding scheme is applied. The transform coefficients are first mappedinto a 1D array based on a predefined scan pattern. After quantization,a block contains only a few significant non-zero coefficients.

Wilson Chung is Senior StaffVideo and Image Processing Engineer at XilinxInc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.