Optimizing Video Encoding using Threads and ParallelismParallelization using threads on multiple logical processors is an attractive and effective way to optimize software. As technologies to simulate multiple processors (such as Hyper Threading) and processors containing multiple cores become the standard for even consumer level computing, the importance of parallelization becomes apparent.
To properly parallelize software, however, it is important to understand the algorithm well enough to determine if data or functional decomposition would be better suited. An excellent example showing the benefits of parallelization is the encoding of video using the H.264 encoder.
As the emerging codec standard becomes more complex, the encoding and decoding processes require much more computation power than most existing standards. The H.264 standard includes a number of new features and requires much more computation than most existing standards, such as MPEG-2 and MPEG-4.
Even after media instruction optimization, the H.264 encoder at CIF resolution still is not fast enough to meet the expectation of real-time video processing. Thus, exploiting thread-level parallelism to improve the performance of H.264 encoders is becoming more attractive.
As shown in this two part case study on optimizing the design of an H.264 video encoder using threads and parallelism, multithreading based on the OpenMP programming model is a simple, yet effective way to exploit parallelism that only requires a few additional pragmas in the serial code.
Developers can rely on the compiler to convert the serial code to multithreaded code automatically via adding OpenMP pragmas. The performance results have shown that the code generated by the Intel compiler delivers optimally increased speed over the well-optimized sequential code on the architecture with Hyper-Threading Technology, often boosting performance by 20 percent on top of native parallel speedups, ~4x without HT in this case, with very little additional cost..
Threading a Video Codec
Exploiting thread-level parallelism is an attractive approach to improving the performance of multimedia applications that are running on multithreading general-purpose processors. Given the new dual-core and emerging multi-core processors, the earlier you start to design for multithreading, the better.
As you will see, one implementation that uses the taskqueuing model is slightly slower than optimal performance, but the application program easier to read. The other method goes for speed. The results have shown speed increases ranging from 3.97x to 4.69x over the well-optimized sequential code performance on a system of four Intel Xeon processors with HT Technology.
H.264 (ISO/IEC 2002) is an emerging standard for video coding, which has been proposed by the Joint Video Team (JVT). The new standard is aimed at high-quality coding of video contents at very low bit-rates. H.264 uses the same model for hybrid block-based motion compensation and transform coding that is used by existing standards, such as those for H.263 and MPEG-4 (ISO/IEC 1998).
Moreover, a number of new features and capabilities in H.264 improve the performance of the code. As the standard becomes more complex, the encoding process requires much greater computation power than most existing standards. Hence, you need a number of mechanisms to improve the speed of the encoder.
One way to improve the application's speed is to process tasks in parallel. Zhou and Chen demonstrated that using MMX/SSE/SSE2 technology increased the H.264 decoder's performance by a factor ranging from two to four (Zhou 2003). Intel has applied the same technique to the H.264 reference encoder, achieving the results in Table 1 below using only SIMD optimization.
|Table 1. Speedup of Key H.264 Encoder Modules with SIMD Only|
Although the encoder is two-to-three times faster with SIMD optimization, the speed is still not fast enough to meet the expectations of real-time video processing.
Furthermore, the optimized sequential code cannot take advantage of Hyper-Threading Technology and multiprocessor load-sharing, two key performance boosters that are supported by the Intel architecture. In other words, you still can improve the performance of the H.264 encoder a lot by exploiting thread-level parallelism.
Parallelization of the H.264 Encoder
By exploiting thread-level parallelism at different levels, you can take advantage of potential opportunities to increase performance. To achieve the greatest speed increases over well-tuned sequential code on a processor with HT Technology, you should consider the following characteristics as you re-design the H.264 encoder for parallel programming:
* The criteria of choosing data or task partitions
* The judgments of thread granularity
* How the first implementation uses two slice queues
* How the second implementation uses one task queue
Task and Data Domain Decomposition
You can divide the H.264 encoding process into multiple threads using either functional decomposition or data-domain decomposition.
Functional decomposition. Each frame should experience a number of functional steps: motion estimation, motion compensation, integral transformation, quantization and entropy coding. The reference frames also need inverse qualification, inverse integral transformation, and filtering. These functions could be explored for opportunities to make these tasks parallel.
Data domain decomposition. As shown Figure 1 below, the H.264 encoder treats a video sequence as many groups of pictures (GOP). Each GOP includes a number of frames. Each frame is divided into slices. Each slice is an encoding unit and is independent of other slices in the same frame.
The slice can be further decomposed into a macroblock, which is the unit of motion estimation and entropy coding. Finally, the macroblock can be separated into block and sub-block units. All are possible places to parallelize the encoder.
|Figure 1. Hierarchy of Data Domain Decomposition in the H.264 Encoder|
To choose the optimal task or data partition scheme, compare the advantages and disadvantages of two schemes below:
* Scalability. In the data-domain decomposition, to increase the number of threads, you can decrease the size of the processing unit of each thread. Because of the hierarchical structure in GOPs, frames, slice, macroblocks, and blocks, you have many choices for the size of processing unit, thereby achieving good scalability.
In functional decomposition, each thread has different function. To increase the number of threads, partition a function into two or more threads, unless the function is unbreakable.
* Load balance. In the data domain decomposition, each thread performs the same operation on different data block that has the same dimension. In theory, without cache misses or other nondeterministic factors, all threads should have the same processing time. On the other hand, it is difficult to achieve good load balance among functions because the chosen algorithm determines the execution time of each function.
Furthermore, any attempt to functionally decompose the video encoder to achieve a good load balance depends on algorithms, too. As the standard keeps improving, the algorithms are sure to change over time to exploiting thread-level parallelism at multiple levels to achieve a good load balance.
Considering these factors, you could use the data-domain decomposition as your multithreading scheme. Details are described in the following two sub-sections.
When you have decided on the functional decomposition or data domain decomposition scheme, the next step is to decide the granularity for each thread. One possible scheme of data domain decomposition is to divide a frame into small slices.
Parallelizing the slices has both advantages and disadvantages. The advantage lies in the independence of slices in a frame. Since they are independent, you can simultaneously encode all slices in any order. On the other hand, the disadvantage is the resulting increase in the bit rate.
Figure 2 below shows the video encoder rate-distortion when you divide a frame into varying numbers of slices. When a frame is divided into nine slices but quality is held at the same level, the bit-rate increases about 15 to 20 percent because slices break the dependence between macroblocks.
|Figure 2. Encoded Picture Quality Versus the Number of Slices in a Picture|
The compression efficiency decreases when a macroblock in one slice cannot exploit a macroblock in another slice for compression. To avoid increasing the bit-rate at the same video quality, you should exploit other areas of parallelism in the video encoder.