Implementing H.264 video compression algorithms on a software configurable processor - Embedded.com

Implementing H.264 video compression algorithms on a software configurable processor

The significant coding efficiency of H.264 video compression technologyopens a wide range of new applications for streaming video over avariety ofmedia.

This international standard has risen in importance with its recentadoptionby 3G, DVD Forum, and DVB, joining MPEG-2 as one of the world’s mostcommon digital video formats. The H.264 standard provides advancedalgorithms for motion estimation, inter-prediction, spatial intraprediction and transforms. The H.264 standard supports motionestimation on blocks from 16×16 to 4×4 pixels.

Residual data transforms are executed on 4×4 blocks with modifiedinteger discrete cosine transform (DCT) which avoids rounding errors.In common with other standards, such as MPEG, the H.264 codecimplementation is not explicitly defined.

The standard defines the syntax of the encoded bit stream and themethod for decoding the bitstream. Developers of H.264 require adevelopment methodology that enables experimentation and refinements oftheir algorithms and the ability to deliver a real-time encoding of thevideo.

Keeping pace with Moore’s law, today’s CPUs continue to be massivelypowerful, but their architectures are not well suited for videoprocessing. One approach has been to augment the CPU with hardwareacceleration units called intrinsic instructions (such as Intel’sMMX/SSE2 and AMD’s 3DNow extensions). Acceleration hardware can bedesigned to support theblock-based and pixel-level processing tasks that are not efficientlyhandled by the CPU architecture.

However, many of the core encoding tasks, such as motion estimationetc., which consume many CPU cycles, are also very dataflow intensive,thus requiring deep register pipelines with fast memory accesses.Traditionally this has been best met with a purely hardware approach.This paper willdescribe the design methodology and the results of using a single300MHz software configurable processor to achieve H.264 encoding ofStandard Definition (SD) video at 30 fps.

An essential element of the recipe for success for real-time videoencoding applications is delivering the best image quality that isfeasible for a particular screen resolution, given real-world operatingconstraints. For example, an uncompressed D1 video stream has an imagesize of 720 x480 pixels and requires 1.5bytes for color per pixel. Such a stream has518kilobytes per frame, and at 30 frames per second consumes animpressive — and equally impractical — 15.5 megabytes per secondstorage and bandwidth.

For network-based applications, bandwidth is only one of thelimiting factors for video. The larger a video stream, the moreopportunities there are for packets to be delayed and therefore becomeunusable. Additionally, available network pipes need to peacefullycoexist with other real-timestreams such as voice data.

Developers can reduce the bandwidth of a stream by reducing screenresolution, but in many cases, the loss of image quality isunacceptable in the marketplace.

Lossy compression algorithms provide the most cost-effective meansfor lowering bandwidth while retaining quality. Successive versions ofthe MPEG video standard have continued to lower bit rate whileretaining more image quality through increasingly complex algorithms.

H.264, also known as MPEG4-part 10, was designed to facilitate videotransport over IP networks, and provides a substantial compressionratio over MPEG-2 while delivering equivalent or better image quality.In order to get this additional compression requires additional tools,new algorithms and more computations.

From a technical standpoint, the primary difference between H.264and its predecessors is the use of multiple reference frames, widersearch ranges, and smaller macro blocks for motion estimation whichultimately translates to computational intensiveness.

Figure1: H.264 encoder architecture

Figure 1 above shows the architecture overview of an H.264 encoder.Theefficiency of the encoder can be found in the implementation of thefollowing functions:

(1) Intra-prediction utilizing Forward and
(2) Inverse Discrete Cosine Transforms (DCT),
as well as Forward and Inverse Quantization
(3) De-blocking filtering
(4) Motion Estimation employing interframe
comparisons to _ PEL and _ PEL
accuracy.

Each of these functions requires extensive processing and must beperformed in real-time to be useful. The remainder of this paper willfocus on how software-configurable processors provide a developmentmethodology for creating cost-effective H.264 implementations thatenable experimentationand refinement of algorithms.

The need for hardware acceleration
Clearly, a task as compute intensive as H.264 requires hardwarecceleration.OEMs provide significant value in how they implement functions such asmotion estimation, and a programmable platform is required to enabledevelopers to continue to refine their algorithms and sharpen theircompetitive edge.

A single traditional programmable processor has not provided thenecessary performance or compute resources to implement real-timeencoding at the highest level. And for base implementations, thedevelopment requires hand-optimized assembly language, a timeconsuming, architecturespecific implementation.

ASIC implementations, at this stage of H.264 adoption, areinflexible. Implementing proprietary algorithms in programmable logicappears to provide the flexibility to address the evolving algorithm.These types of implementations typically have a programmable processorto handle application-level tasks and manage the flow of data to theprogrammable logic devices.

The challenges of these implementations are many. First, there are avariety of ways to interconnect the processor and logic devices, suchas an FPGA. The choice of the interface mechanics affects thethroughput and thus the efficiency of the multi-chip solution.

The interfaces can be defined to operate synchronous orasynchronous. Synchronous operation burdens the hardware designer tomeet timing requirements, while asynchronous operation introducessignificant latency forcing stalls in the processor to complete thedata interchange.

The asynchronous interface can often expose limitation in thehardware handshake interfaces which may further reduce the effectivedata bandwidth. Using discrete devices as illustrated in Figure 2,below, the CPU must prepare data and hand it off to the FPGAcoprocessor unit.

Figure2: Stretch software-configurable processor

The CPU must then wait for the FPGA unit to make the computation.Eventhough the units may appear to operate in parallel, the interdependencyweakens the effect of pipelining of operations. Second, the CPU plusFPGA architecture has two development environments, furthercomplicatingthe design. Often the FPGA architecture is based on a softwareimplementation of a working solution failing to meet the performancerequirements.

The hardware team accepts the software algorithm then recodes thecomputational blocks in a hardware description language (HDL) and usesa separate verification methodology. Finally, in order to change thesoftwarealgorithm or the context of the application, the FPGA coprocessor mustalso be changed. For an evolving algorithm, this issue cansignificantly impact the flexibility to evolve the algorithm and thedelivery of a timely solution to the market place.

Software-configurable processors address these issues by integratingprogrammable logic within the processor datapath (Figure 2, above),enabling the efficient handoff of data, a single development languageand efficient changes to the software algorithm under control of thealgorithm developer (Figure 3, below). With the Stretchsoftware-configurablearchitecture, entire functions written in C/C++ are compiled by theStretch C compiler to form extension instructions that reside in theprogrammable fabric.

Figure3: Development flow for a software-configurable processor

Using a single design environment, accelerating the computeintensive functions of an algorithm is accomplished at softwarecompile-time. There is no need to recode the function in assemblylanguage nor hand-off the functioning C/C++ code to a hardware engineerto redesign, rewriteand add a complex processor interface to the logic.

The Stretch S5000 family of software configurable processors isbased on the S5 compute engine that combines a RISC processor withprogrammable logic resources, or Instruction Set Extension Fabric(ISEF), within the processor’s datapath.

These instructions are accessed through the processor’s instructionpipeline just like any other instruction, making custom instructionsequivalent to the processor’s native instruction set. A 128-bit wideregister set facilitates efficient passing of large amounts of data andcontext to the ISEF so that multiple data can be processed in parallel.

The integration of extension instructions provides a substantialincrease in the performance of complex algorithms. The S5 computeengine, operating at 300MHz, without extension instructions operateslike a normal 300MHz RISC processor; however, with the addition ofextensions the processor outperforms multi-GHz processors.

As an example, the Stretch software-configurable processorout-of-the-boxperforms the EEMBC Telemark benchmarks with a score of 4.6. With theaddition of extension instructions, taken from the original C/C++, thesoftware-configurable processor score climbs to 877, outperformingassembly language optimized multi-GHz processors.

Implementing H.264
The algorithms used in H.264 are prime candidates for flexible hardwareacceleration because of the amount of computation required. Additionalacceleration can be achieved by taking advantage of the inherentparallelism in these algorithms.

Consider the processing of 4×4 blocks of luma pixels through a 2-Ddiscrete cosine transform (DCT) and quantization step. The H.264 DCTand quantization matrices are shown in Figure 4, below.

Figure 4: 4×4 DCT and quantization

The computations for the DCT portion of the matrix computation canbe reduced to 64 add and subtract operations by taking advantage ofsymmetry and common sub-expressions.

All 64 operations can be combined into a single ISEF instruction.Quantization (Q) follows the DCT. Costly division is avoided byimplementing quantization as a simple multiply and shift operation.Total processing required for luma encode and decode using DCT + Q +IDCT + IQ is about 594 additions, 16 multiplies, and 288 muxes(decisions).

The wide 128-bit bus to the ISEF can load a row of eight 16-bitprediction data in a single cycle. These instructions are performed asa single ISEF instruction where the optimizing compiler has exploitedthe inherent parallelism in the function.

The total number of cycles to perform these operations on a 4x4block using a standard processor is over 1000 cycles. The sameprocessing can be done in a software configurable processor in 105cycles, offering more than a 10X acceleration. In application terms,this means that a 720 x 480 @ 30 fps video stream will only require14.2% of the software-configurable processor.Note that additional acceleration is possible by operating on largersub-blocksizes to reduce overall cycle count through increased parallelism.

As an example, operating on two 4×4 blocks in parallel reducesexecution time in half, dropping to 7.1% utilization. Acceleratingde-blocking requires developers to minimize conditional code. Insteadof consuming cycles to determine which values to calculate — this wouldentail selecting which data to send to the ISEF and then creating aninstruction to perform the appropriate operation — it is actually moreefficient to create a singleISEF instruction that calculates all the results in hardware and thenselect the appropriate result.

Re-ordering the 128-bit result from the IDCT stage simplifiespacking of sixteen 8-bit Edge Data Pixels into a single 128-bit wide tofeed the de-blocking custom instruction. Pre-calculating macro blockparameters — including bS, a, B, tco and chromaEdgeFlag — is anothereffective optimization option supported by the state registers insidethe ISEF and the instruction.

The filter’s inner loop loads the 128-bit register and executes thedeblockFilter() custom instruction, computing two edges perinstruction. Because the same custom instruction can be used for bothhorizontaland vertical filtering, there is zero overhead.

This inner loop takes three cycles and is executed twice (horizontaland vertical), with about 20 cycles for loop overhead. With 64 edges ineach MB of data, there are approximately 416 (64 / 4 * 26) cyclesrequired per MB. For a 720 x 480 @ 30 fps stream, this results in 16.8Mcycles/sec, orapproximately 5.2% processor utilization.

Motion estimation calculation
Motion estimation is known to consume most of the CPU budget (50-60%),but it is also very dataflow intensive. The key computationrequirements are the repeated Sum of Absolute Differences (SAD)calculations used in determining the best motion vector match.

The data calculations and comparisons are very repetitive with manyof the intermediate results needing to be reused. These large data setsdo notfit well within the limited register space of the traditional CPU andDSP architectures. Also these CPU and DSP implementations struggle tofeed the fixed arithmetic and multiplier units from the data cache.

With the Stretch software- configurable processor, the ISEF iscapable of performing computations in parallel and holding theintermediate results in the state registers while executing fullypipelined SAD instructions.

Motion estimation consists of potentially 41 SAD and 41 motionvectors (MV) calculations per macro block. A full motion search on asingle macro block requires 262K operations for D1 @ 30 fps, for atotal of 10.6 GOPS.

By using heuristic algorithms, which is the “secret sauce” for manyimplementations, the developer can minimize the computations to meettarget image quality and/or bit rate requirements.

Custom algorithms optimized to perform estimates across differentsearch areas, numbers of frames, or the number of motion vectorsneeding to be calculated can easily be converted to ISEF instructions.A single custom ISEF instruction can replace multiple computations, aswell as pipeline many of the computations using intermediate results.

For example, a single custom instruction can perform 64 SADcalculations.However, the 64 partial sums could be maintained in the ISEF to reducethenumber of data transfers and reuse the results in the next instruction.By pipelining ISEF instructions the compute capacity increases.

Motion estimation also involves _- and _-pixel predictions thatrequire 9 SADs computed in 9 directions around the pixel. Although astandard formula, it is computationally intense as well.

By using custom instructions, a 16×16 SAD calculation with quarterpixel precision takes 133 cycles, and a 4×4 SAD calculation withquarter pixel precision takes 50 cycles. H.264 contains a rich set oftools for efficient video compression, making it adaptable to multipleapplications and markets.However, it is computationally intense and requires an architecturethat enables developers to utilize hardware acceleration.

Software-configurable processors abstract hardware to the degreethat developers can write application and algorithmic code in C so thata single development environment results in hardware and softwareoptimizedtogether.

Custom instructions significantly reduce cycle counts per function,enablingdevelopers to implement H.264 in real time at a cost-point thoughtpreviously impossible. An advantage the software-configurable logic inthe processor datapath is the continual improvement in performance ofthedevices at a relatively low cost of processing.

As a result, the hardware costs associated with a compact platformcapable of billions of operations per second is economicallydeployable. The dynamic nature of a software-configurable architectureis that optimized custom instructions can be loaded into the ISEFduring run-time. In this way, as the need for different instructionschanges, so does the configuration.

Overhead for configuring the ISEF can be effectively reduced to zeroby 'ping-ponging' between configurations; that is, the developerconfigures one of the ISEF banks before the instructions are needed,then switches over to a new configuration. In this way, the newinstructions are available immediately.

Thus, developers are able to create specific implementations thatare not burdened by the parsing overhead imposed when algorithms areimplemented in a generalized fashion. This also allows developers tore-use ISEF resources through programmability, further extending thepotential capabilities of an application.

Scalability
Because the custom instructions are created by the compiler based onC-code written by developers, adapting and updating an algorithm is astraightforward process that avoids the extensive rewriting andperformanceprofiling required when code is hand-written in assembly. As aconsequence, applications developed on software-configurablearchitectures scale easily.

For example, to convert a hand-coded assembly implementation ofH.264 to a different screen resolution requires re-engineering of thealgorithm. With a software configurable architecture, the compiler doesall of the heavy lifting.

This flexibility is critical for rapidly evolving algorithms such asH.264. If changing an algorithm requires too much re-engineering, it isa barrier to algorithm development. A flexible architecture freesdevelopers to experiment and discover, as well as cost-effectivelyimplement, innovation.

Frank Lee is a Member of Technical Staff at Stretch, Inc.

For a PDF version of this article go to “ImplementingH.264 algorithms,” at.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.