How do you implement audio and video software for consumer products? Two experts offer some guidelines to consider based on years of practical experience.
Audio and video software for consumer media products can be quite complex, combining real-time signal processing, network protocols, complex I/O, and sophisticated user interfaces. In this article, we examine some of the key aspects of implementing audio and video (A/V) software for consumer products, particularly COder-DECoder (codec) software. The ideas presented in this article are based on years of hands-on experience in the development of audio and video software for set-top boxes, cell phones, PDAs, Internet appliances, and a variety of other products.
Figure 1: Simplified software architecture for consumer media product
Media device software components
The typical consumer media device is a complex system, composed of several software and hardware subsystems. As shown in Figure 1, the major software subsystems include the player, the compression algorithms (shown here as encoders and decoders, and often referred to as codecs), I/O modules, and the real-time operating system (RTOS), which provides task scheduling and switching services critical to multi-tasking, real-time applications. Of these components, codecs are usually the most computationally demanding and require the bulk of the software optimization effort. Thus, our discussion of software optimization focuses on codecs, although the techniques we describe can be used on other types of A/V software as well.
Codecs compress audio and video content for transfer or storage, and decompress the content for playback. Many different codec algorithms are used in consumer products today; Table 1 lists a few popular A/V codecs.
Table 1: Common compression/decompression algorithms
|Licensing/ Standardization Body||Video||Audio|
|ISO/IEC||MPEG-1/2/4||MPEG-1/2 Layers I/II/III|
|Microsoft||Windows Media Video 9||Windows Media Audio 9|
|RealNetworks||RealVideo 10||RealAudio 10|
|*H.264 was jointly developed by the ITU and ISO. It is also known as MPEG-4 part 10.|
MPEG-2 video is the most popular video compression algorithm in use today. The MPEG-2 standards, first released in 1994, are used for Digital Versatile Disk (DVD) movies, set-top boxes, and satellite TV (DSB/DVB), among other applications. Other video compression algorithms in use today include Microsoft Windows Media Video and RealNetworks RealVideo.
The MPEG format commonly used for audio compression is MPEG-1/2 layer III, which is the well-known “MP3” format. Windows Media Audio (WMA) and Dolby AC-3 (also known as Dolby Digital) are also popular in audio compression. Many portable digital music players support WMA; almost all DVDs use AC-3 audio compression.
Depending on the algorithm selected, codec software may be available from the codec creator, the chip vendor, or an independent developer. Codec software can be provided in a variety of forms, from a high-level reference implementation designed for easy reading (but not efficiency), to a highly optimized implementation targeted to a particular processor. Although some optimized codec implementations are available for popular processors, often a system developer needs to implement, optimize, and possibly customize a codec for a specific product.
Audio and video codecs are typically the most resource-hungry components of the software in a consumer media product. The combination of high data rates, demanding algorithms, and the need to use inexpensive (and possibly low-power) processors means that careful optimization of codec software is almost always required in order to create a competitive consumer media product. Even in cases where such optimization may not be strictly necessary, it's often beneficialit can lower power consumption, free up processor resources to permit the addition of more features, or facilitate the use of a less expensive processor.
The codec software optimization process can focus on a number of different performance metrics. These include execution speed, memory use, energy consumption, and audio/video quality. In some cases optimizing for one metric also optimizes another, while in other cases optimizations conflict. Most often, developers optimize primarily for execution speed, memory use, and energy consumption, while maintaining sufficient video and audio quality.
As suggested earlier, extensive execution-speed optimizations are usually required to achieve real-time performance goals. The need for such high levels of optimization is primarily driven by the combination of two factors: the high computational demands of compression and decompression algorithms (which execute complex mathematical operations at high sample rates) and the use of low-cost processors with limited performance.
Figure 2: Iterative optimization process
The optimization process, shown in Figure 2, is iterative, beginning with profiling, followed by analysis, then implementation of specific optimizations. This cycle is repeated until performance targets are met.
The first step is to profile the codec software at the function level. This yields information about the percentage of the total execution time attributable to each sub-function or the number of times each sub-function is called. Typically, the results loosely adhere to an 80/20 rule: 20% of the software is responsible for 80% of the execution time and vice-versa. Of the three general classes of operations shown in Table 2, the S-rate operations are the ones that typically dominate processor loading.
Table 2: Categorization of codec sub-functions based on frequency of invocation
|I-rate: initialization rate||~1 time per second|
|K-rate: control rate-parameter updates, etc.||~10 to 1,000 times per second|
|S-rate: sample rate||~10,000 to 100,000 times per second|
In the classification of Table 2, the frequency of occurrence differs by two to three orders of magnitude between adjacent levels. Thus, execution cost of operations at each level also differs by two to three orders of magnitude. Optimizing I-rate and K-rate operations typically has little impact on overall execution time because these functions occur relatively infrequently; however, S-rate operations, which have high frequencies of occurrence, must be thoroughly optimized for maximum efficiency.
Signal processing functions typically have outer loops at the K-rate and inner loops at the S-rate. The 80/20 rule suggests that codec software contains a relatively small number of functions with S-rate loops. This attribute can greatly simplify the optimization process, since only a small fraction of the total functions in the codec typically require careful optimization.
There are many optimization techniques that can be useful for media applications. Here we divide these into two categories: processor-independent software optimizations and processor-specific software optimizations.
Beginning with the most time-consuming function, as identified by profiling, we analyze the function and begin to develop an optimization strategy. For example, a common strategy is to first consider processor-independent optimizations that maintain software portability. If additional optimizations are needed, we proceed with processor-specific high-level language optimizations, such as algorithmic transformations. These optimizations, while geared towards a specific processor, do not eliminate portabilitybut they may not yield good performance on another processor. Finally, additional processor-specific optimizations may be implemented at the assembly language level. These optimizations offer the potential to deliver the greatest gains in efficiency, at the expense of portability.
The first class of optimization techniques is processor-independent software optimization. Most A/V compression algorithms are first implemented in C. In many cases, the first implementations of a codec are not written with maximum efficiency in mind. For example, reference implementations are often written with the goal for clarity of documentation rather than efficiency. Therefore, it's often possible to reduce the processing and memory requirements of a codec significantly by modifying or rewriting the high-level language code. For example, most compilers employ an optimization technique called strength reduction, in which costly operations are replaced by simple ones where possible. There are some situations where the compiler isn't smart enough or lacks sufficient information about the algorithm to apply strength reduction, but a good programmer is able to rewrite the code to avoid costly operations. If the costly operations are in S-rate loops, the savings can be substantial.
Other processor-independent optimizations include function in-lining and recycling of memory buffers. Flattening the function call hierarchy by in-lining functions can yield significant speed-ups when low-level, frequently used operations have been placed in their own functions. Recycling memory buffers primarily decreases memory use, but can sometimes also result in speed-ups due to improved memory system performance (improved cache hit rates, for example).
The second class of optimization techniques is processor-specific optimization. Such optimizations may apply at the high-level language or assembly code level.
- Algorithmic modifications and transformations:
As the name suggests, algorithm modifications involve changing the internal operations of an algorithm in order to better match the available hardware resources. Many algorithm modifications involve trade-offs between memory use and computation. For example, many algorithms use large look-up tables. These look-up tables may occupy more memory than is available in the target system or than can be efficiently accommodated by the processor's level-1 data memory (the memory closest to the processor core). In such cases, a typical strategy is to replace the look-up table with algorithmic steps that compute the values formerly held in the table. A compromise approach is to reduce the size of the look-up table and to interpolate values that have been omitted from the table. This compromise typically requires less memory than the purely table-based approach and less execution time than the purely algorithmic approach.
Another classic example is a fast Fourier transform implemented with a radix-4 vs. radix-2 butterfly. The total number of mathematical operations (the sum of additions and multiplications) is similar for either implementation. However, the ratio of additions to multiplications differs between these implementations: the radix-4 implementation requires more additions and fewer multiplications than the radix-2 implementation. If the available hardware resources favor addition over multiplication, then the radix-4 implementation may offer better performance.Similarly, the FIR (finite impulse response) filter algorithm can sometimes be modified to better match the hardware resources. For example, the coefficients and inputs can use a single shared index variable if the coefficients are rearranged in memory. In some cases branch testing against a zero or negative loop-counter register is directly supported in hardware, so a count-down loop is more efficient than a count-up loop.
- Assembly language programming
Optimization at the assembly level can yield the greatest increases in software efficiency. After carefully analyzing the most critical sections of the algorithm and the details of the processor architecture, the assembly language programmer can often gain impressive speed-ups relative to the performance of compiler-generated code. It is often the case, for example, that compilers do not make use of the processor's full instruction set and almost never make use of specialized processor features such as single-instruction multiple-data (SIMD) instructions, which allow multiple operations to be performed in parallel. By working in assembly language, developers gain access to the full range of the processor's capabilities.
Memory access optimizations
Due to the large amounts of data handled by A/V applications, the processor's on-chip and off-chip memory system is often a critically important determinant of performance in these applications. For example, many low-cost embedded processors have small on-chip level-1 data memoriesfar too small to hold critical blocks of data and coefficients for audio and video decompression algorithms. In such cases, the level-1 data memory tends to be the performance bottleneck and the processor is unlikely to achieve anything near the level of performance suggested by its peak MIPS rating. This problem can be compounded by the fact that processing one video frame often requires accessing data from two or three adjacent frames. The instruction memory can become a bottleneck too, since video algorithms in particular can require a large amount of code.
Memory access optimizations can greatly improve overall performance by minimizing memory access overhead. For example, the default unit for processing in video algorithms is a frame. That is, the algorithm performs one operation after another on a frame of video until the entire processing sequence is done and then begins again with the next frame. Frames, however, are typically much larger then the processor's level-1 memory. Thus, processing an entire frame requires pulling in new portions of the frame and overwriting old portions of the frame at each processing step, over and over again through the entire processing sequence. An optimization to this method is to change the fundamental unit of operation from a frame to a subset of a frame that can fit entirely in the level-1 memory. Now the entire processing sequence is performed on the frame subset that stays resident in the level-1 memory. This type of optimization improves execution time and can reduce power consumption by reducing external memory accesses.
Thorough testing of consumer media products is made challenging by the large amounts of data required for testing, the subjective nature of audio and video quality, and the large number of operating modes typically supported by audio and video codecs. Some of the most challenging testing issues, however, involve the interoperability of all of the hardware and software subsystems and how this affects real-time performance. The combination of underpowered processors and high degrees of software optimization may lead to a state of unstable equilibrium where real-time performance is concerned. That is, careful optimizations of the codec(s) may have pulled the system up to real-time performance levels, but just barely; a slight perturbation may then cause a failure. It's cases like these for which thorough testing becomes especially important.
High data I/O requirements
Audio and video devices consume and produce vast amounts of data. Thus testing requires the ability to inject large input test vectors and to capture corresponding output streams. While the end user typically will access output streams in analog form after digital-to-analog conversion, the developer will want to capture the output digitally for testing. One key question is how to test a compression algorithm implementation prior to the availability of hardware prototypes. Because of the large amounts of data needed for testing, a simulation model of the processor is usually far too slow; real hardware is needed. When testing audio and video software with a development board, it's critical to ensure that there's a means for streaming large amounts of compressed digital data into the processor and large amounts of decompressed digital audio and video out of the processor, or vice versa, as appropriate. For developers, this means that carefully choosing a development platform with appropriate I/O capabilities can save time and reduce headaches when the testing phase begins.
Audio, video quality
Audio and video compression algorithms like those listed in Table 1 are “lossy” or “perceptual” algorithms: they exploit weaknesses in human visual and aural perception to reduce storage and bandwidth requirements. A consequence of this is that after compression and decompression, the reconstructed output is not an exact reproduction of the original signal. Lossy compression techniques complicate testing, since traditional test metrics like signal-to-noise ratio of the reconstructed output vs. the original input will indicate significant degradation of the signal, but the viewer or listener generally doesn't perceive those differences. Because traditional test metrics like signal-to-noise ratio don't provide meaningful information in the context of a lossy compression algorithm, alternative testing methods are typically employed.
While a compression algorithm is being developed, structured listening and viewing tests provide developers with feedback on the perceived quality that the algorithm achieves. Once the compression algorithm has been finalized, the developers determine how various implementations of the algorithm will be tested. Implementation testing typically relies on a set of special test vectors and an official “reference” version of the codec. The test vectors and reference codecs are usually supplied by the vendor of a compression algorithm or by a standardization body. The tolerable error between the implementation and reference codec is codec-specific but is generally very low and may be zero.
Be warned, however, that reference audio codecs generally use floating-point arithmetic, while embedded media devices that implement these codecs typically use fixed-point arithmetic for cost reasons. Ideally, the output of the fixed-point codec will match that of the floating-point reference codec, but this depends on the numeric fidelity of the fixed-point implementation. Lesser quality fixed-point implementations may perform reasonably well with benign audio content, but fail when faced with more challenging content. The supplied test vectors may have been designed with floating-point arithmetic in mind and may not adequately stress a fixed-point implementation. Thus, to ensure thorough testing of a fixed-point implementation it's essential to obtain or create tests that exercise the full potential dynamic range of the algorithm.
Real-time: not optional
The need for consistent real-time performance is clear; without it audio stutters and coughs and video freezes and jumps. To stress the limits of the system's real-time behavior, developers typically test the most demanding operating mode, such as a mode with the highest bit rate, the highest sample rate, the highest video resolution, and the greatest number of audio channels. Although this certainly is a good start, there's no guarantee that this approach actually tests the worst case. This is because compression algorithms typically have data-dependent execution paths. That is, the path that the processor takes through the software depends heavily on characteristics of the input data. Further complicating matters, the number of cycles required to execute basic operations such as multiplications is data-dependent in some processors. Thus, to successfully test worst-case real-time performance, the most demanding operating mode must be combined with an input stream that ensures that the worst-case execution path is taken and that any operations with data-dependent timing are presented with worst-case inputs.
Strategies for success
Within a consumer media product, audio and video codecs often pose the greatest software development challenge. Demanding algorithms, stringent audio and video quality requirements, and the limitations of inexpensive processors all contribute to the difficulty of developing a successful media product. Careful software optimization and testingalong with well-considered component selectioncan help developers to meet the challenge.
Jeff Bier is general manager of Berkeley Design Technology, Inc. (BDTI). Jeff oversees BDTI's consulting services, which provide companies with guidance for technology and business decisions involving digital signal processing technology. He also oversees BDTI's publications and technology seminars. Jeff holds degrees from Princeton and UC Berkeley and can be reached at .
Bjorn Hori is a DSP engineer at BDTI. Bjorn is involved in benchmarking and analysis of processors, tools, and other technology for signal processing applications. Bjorn holds a degree from UC Berkeley and can be reached at .