Implementing the right audio/video transcoding scheme in consumer SoC devices -

Implementing the right audio/video transcoding scheme in consumer SoC devices


Recently, the way consumers access audio/ video (A/V) media content hasshifted away from using separate data (PC via modem), voice(telephone), and video (DVD players) devices, towards newer integrateddevices that provide all three services and support applications suchas multiplayer gaming and true video-on-demand (VOD ) .

Furthermore, many portable device, like portable media players andpersonal data assistants (PDAs), now offer a range of services andapplications, such as MPEG-2, MPEG-4 Simple Profile(SP) , H.264, VC- 1, On2, and DivX.

Internet and cellular networks are also being used to access andview A/V media all over the home. The consumer wants to be able to moveA/V content easily from device to device and location to location, costeffectively and in realtime or faster.

This requires the exchange of stored or viewed content betweendevices within the home, or even outside the home, so it can be viewedon alternative devices, at different times and even by differentconsumers.

To achieve this, there is a need for adaptive devices that canprovide three primary functions:

1. Adaptivecontainer format capability – creating the desired containerformat supported by the client device performing the video display/audio playback;

2. Adaptivenetwork protocol capability – allowing compatibility intransmission/reception protocols and modes among different devices toensure accurate/reliable delivery of A/V media content between devices,and;

3. Transcoding – resolving incompatibilities in compression formats, displayresolutions, memory capacity and processing power of different devices.

This article discusses these important functions and how they can beimplemented with a single system-on-a-chip (SOC), such as TexasInstruments' (TI) DaVinci family of processors. The processes involvedin a typical A/V media client and server application are eitherdeterministic (Ethernet, USB, FLASH, A/V Demux, A/V Mux), ornon-deterministic (video capture/display, audio capture/playback,video/audio decode/encode).

The non-deterministic processes are aperiodic, and the deterministicprocesses periodic. From an architectural standpoint, nondeterministicand deterministic processes should not mix. Ideally, a host processorshould handle the non-deterministic processes with a digital signal processor (DSP)serving as the coprocessor and performing the more computationallyintensive deterministic operations.

With this architecture, overall system throughput is high since thenondeterministic processes are prevented from disrupting the DSPprocesses.

This architecture becomes even more appealing as the amount ofprocessing increases, specifi cally for HD applications or when videoencoding is needed. To support this architecture, TI has developed theDaVinci family of DSPs. Figure 1 below shows the DM6446 in the DaVinci family, which has an ARM926 and a C64xDSP.

Figure1: DaVinci (DM6446) device block diagram

In this SOC, the internal DSP performs the deterministicprocesses freeing up the ARM to handle the non-deterministic processes.Here the flexible DSP architecture implements the transcoder operationand provides a very good price/performance trade-off. With the DSPimplementing the algorithmically complex task of supporting differentcompression formats, it allows the programmability needed to adapt tofuture compression formats and software upgrades.

Adaptive networking protocols
Although HTTP is the most popular networking protocol for the Internetit is not well suited to transfer temporal content referenced by a timeor frame index. In contrast, the Real Time Streaming Protocol(RTSP)has several states of operation and can process A/V content withtime and frame indexes.

Once the A/V media connection is established, the RTSP protocolmoves through several states as the client makes requests to PLAY,PAUSE, STOP, and CLOSE the session. Although RTSP is more capable forstreaming A/V media content than HTTP, it is more complex, especiallyconsidering all the different modes it can support.

One RTSP mode in particular is the interleaved mode of operationwhereby the container file can be sent without having to be parsed andwithout requiring any RTP/RTCP sockets to be opened.

To help solve incompatibility problems in networking protocols, the DigitalLiving Network Alliance (DLNA) has established a common set ofguidelines.

For instance, all DLNA compliant products must support HTTP and acertain set of extensions so there is a common protocol to transfer A/Vmedia between clients and servers. Furthermore, the DLNA has createdextensions to HTTP to allow for an RTSP type of operation without thecomplexity of RTSP attack implementation.

There are many different container formats available for streamingA/V media. The most popular, such as Microsoft's Audio Video Interleaved (AVI)and Advanced Systems Format (ASF),the MPEG-2 Transport Stream (TS) and Program Stream (PS), and the MPEG-4 File Format (MP4), allprovide some level of metadata support and the ability to store A/Vmedia content.

One characteristic distinguishing between container formats is thehandling of metadata. The MP4 format is good at separating compresseddata from metadata, allowing for very effi cient, reliable transmissionwhere the critical timing information and bitstream parameters,typically found in the headers of the ES, is sent separately from thecompressed frames.

Generally, container formats used in PC applications will tend to beproprietary formats such as with AVI, ASF, and Adobe's Flash Video(FLV), and be more standards-based in the STBs and DVDs, relying onTS and PS. While the MPEG-4 Internet streaming methodology and the MP4file format are popular among portable devices.

Transcoding is the most computationally intensive function. For HighDefinition (HD) content, transcoding is beyond the reach of a typicalhost processor, owing to the huge computational requirements associatedwith decoding and encoding an HD bit stream. In addition, muchalgorithmic work is also involved during the transcoding process andcomes at the expense of additional computational complexity resultingin an overall loss of image quality.

The computational complexity results from the encoder having toperform a full motion search in its motion estimation process since ithas no knowledge of the motion vectors used in the decoding process.

Loss of image quality primarily results from improper allocation offrame and macroblock types between the decoder and encoder; inparticular if the encoder performs an intra (I) frame encode on a framethat was decoded as a bidirectional (B) frame. In this case, a B framewill be used as a reference or predictive (P) frame for past and futureframes of the encoder, propagating future errors.

Implementing Transcoding
During transcoding, passing information about each frame type (I, P,B), macroblock modes and motion vectors from the decoder to the encodercan be very beneficial. If there is a reduction in frame rate from 30fps to 15 fps, for example, knowing where the B frames are in theoriginal source can result in a substantial reduction in processing.

Because B frames are not used as a reference frame, they can besimply removed from the bit stream to arrive at the lower frame ratewithout requiring additional processing by the encoder.

Knowing the motion vector of each macroblock in the HD bit streamcan result in a huge processing reduction by the encode process of atranscoder. The motion search process of any encoder is the mostcomputationally intensive. Knowing the general vicinity of eachmacroblock's motion vector from the original source substantiallyreduces the amount of computations required.

The transcoder's motion search becomes more of a refinement around aknown reference point, instead of a massive search in a very largeimage plane. The point here is that an HD broadcast encoder spends atremendous amount of processing and memory to arrive at an optimal setof parameters for each frame and macroblock in the sequence, often attimes performing multiple passes of the encode process for each frame.

Figure2: Transcoder block diagram

To disregard this information is very unwise and adds a huge amountof unnecessary transcoder processing, resulting at times in increasedartefacts and overall reduced image quality. Having the decoder tightlycoupled with the encoder reduces the overall complexity and results ina more optimal solution. An example of a coupled solution is shown in Figure 2 above.

The decoder provides frame type, macroblock types and modes, motionvectors, quant levels and bit rate parameters to the encoder to ensurebetter decisions on the transcoded bit stream that it creates. Althoughthe encoder has to perform a full encode on the data, many of themodules have reduced complexity owing to the apriori informationprovided by the decoder.

Figure3: STB plus transcoder block diagram

This is especially true in the motion estimation process, which ismore of a refinement than a massive search. If there is an image sizechange, image scaling can create the proper frame sizes prior topassing the frame off to the encoder, and thereby decreasing the memoryrequired in the system, further reducing the overall cost of thetranscoder.

For the different network protocols and container formats, clearly acoprocessor is needed to transcode to the many different codecs thatare currently available such as MPEG-2, H.264 and VC-1.

The DaVinci architecture is well suited for such a task. Figure 3 above provides an exampleof a DVD recorder with a transcoder receiving compressed content fromthe STB and providing transcoded content back, as well as being able todecode the video to be displayed by the STB hardware usingpicture-in-picture (PIP).

This application primarily needs just the transcode functionality,however with other networked devices in the home, the DaVinci hostprocessor can be used to handle the network protocols and containerformats, while the DSP works as a coprocessor to provide the transcodeoperation. This can therefore reduce the overall system cost whileproviding a much needed adaptive media solution in the home.

Tim Simerly is a Video SystemsArchitect at TexasInstruments. Inc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.