Parallelization using threads on multiple logical processors is an attractive and effective way to optimize software. As technologies to simulate multiple processors (such as Hyper Threading ) and processors containing multiple cores become the standard for even consumer level computing, the importance of parallelization becomes apparent.
To properly parallelize software, however, it is important to understand the algorithm well enough to determine if data or functional decomposition would be better suited. An excellent example showing the benefits of parallelization is the encoding of video using the H.264 encoder.
As the emerging codec standard becomes more complex, the encoding and decoding processes require much more computation power than most existing standards. The H.264 standard includes a number of new features and requires much more computation than most existing standards, such as MPEG-2 and MPEG-4.
Even after media instruction optimization, the H.264 encoder at CIF resolution still is not fast enough to meet the expectation of real-time video processing. Thus, exploiting thread-level parallelism to improve the performance of H.264 encoders is becoming more attractive.
As shown in this two part case study on optimizing the design of an H.264 video encoder using threads and parallelism, multithreading based on the OpenMP programming model is a simple, yet effective way to exploit parallelism that only requires a few additional pragmas in the serial code.
Developers can rely on the compiler to convert the serial code to multithreaded code automatically via adding OpenMP pragmas. The performance results have shown that the code generated by the Intel compiler delivers optimally increased speed over the well-optimized sequential code on the architecture with Hyper-Threading Technology, often boosting performance by 20 percent on top of native parallel speedups, ~4x without HT in this case, with very little additional cost..
Threading a Video Codec
Exploiting thread-level parallelism is an attractive approach to improving the performance of multimedia applications that are running on multithreading general-purpose processors. Given the new dual-core and emerging multi-core processors, the earlier you start to design for multithreading, the better.
As you will see, one implementation that uses the taskqueuing model is slightly slower than optimal performance, but the application program easier to read. The other method goes for speed. The results have shown speed increases ranging from 3.97x to 4.69x over the well-optimized sequential code performance on a system of four Intel Xeon processors with HT Technology.
H.264 (ISO/IEC 2002) is an emerging standard for video coding, which has been proposed by the Joint Video Team (JVT). The new standard is aimed at high-quality coding of video contents at very low bit-rates. H.264 uses the same model for hybrid block-based motion compensation and transform coding that is used by existing standards, such as those for H.263 and MPEG-4 (ISO/IEC 1998).
Moreover, a number of new features and capabilities in H.264 improve the performance of the code. As the standard becomes more complex, the encoding process requires much greater computation power than most existing standards. Hence, you need a number of mechanisms to improve the speed of the encoder.
One way to improve the application's speed is to process tasks in parallel. Zhou and Chen demonstrated that using MMX/SSE/SSE2 technology increased the H.264 decoder's performance by a factor ranging from two to four (Zhou 2003). Intel has applied the same technique to the H.264 reference encoder, achieving the results in Table 1 below using only SIMD optimization.
|Table 1. Speedup of Key H.264 Encoder Modules with SIMD Only|
Although the encoder is two-to-three times faster with SIMD optimization, the speed is still not fast enough to meet the expectations of real-time video processing.
Furthermore, the optimized sequential code cannot take advantage of Hyper-Threading Technology and multiprocessor load-sharing, two key performance boosters that are supported by the Intel architecture. In other words, you still can improve the performance of the H.264 encoder a lot by exploiting thread-level parallelism.
Parallelization of the H.264 Encoder
By exploiting thread-level parallelism at different levels, you can take advantage of potential opportunities to increase performance. To achieve the greatest speed increases over well-tuned sequential code on a processor with HT Technology, you should consider the following characteristics as you re-design the H.264 encoder for parallel programming:
* The criteria of choosing data or task partitions
* The judgments of thread granularity
* How the first implementation uses two slice queues
* How the second implementation uses one task queue
Task and Data Domain Decomposition
You can divide the H.264 encoding process into multiple threads using either functional decomposition or data-domain decomposition.
Functional decomposition. Each frame should experience a number of functional steps: motion estimation, motion compensation, integral transformation, quantization and entropy coding. The reference frames also need inverse qualification, inverse integral transformation, and filtering. These functions could be explored for opportunities to make these tasks parallel.
Data domain decomposition. As shown Figure 1 below , the H.264 encoder treats a video sequence as many groups of pictures (GOP). Each GOP includes a number of frames. Each frame is divided into slices. Each slice is an encoding unit and is independent of other slices in the same frame.
The slice can be further decomposed into a macroblock, which is the unit of motion estimation and entropy coding. Finally, the macroblock can be separated into block and sub-block units. All are possible places to parallelize the encoder.
|Figure 1. Hierarchy of Data Domain Decomposition in the H.264 Encoder|
To choose the optimal task or data partition scheme, compare the advantages and disadvantages of two schemes below:
* Scalability. In the data-domain decomposition, to increase the number of threads, you can decrease the size of the processing unit of each thread. Because of the hierarchical structure in GOPs, frames, slice, macroblocks, and blocks, you have many choices for the size of processing unit, thereby achieving good scalability.
In functional decomposition, each thread has different function. To increase the number of threads, partition a function into two or more threads, unless the function is unbreakable.
* Load balance. In the data domain decomposition, each thread performs the same operation on different data block that has the same dimension. In theory, without cache misses or other nondeterministic factors, all threads should have the same processing time. On the other hand, it is difficult to achieve good load balance among functions because the chosen algorithm determines the execution time of each function.
Furthermore, any attempt to functionally decompose the video encoder to achieve a good load balance depends on algorithms, too. As the standard keeps improving, the algorithms are sure to change over time to exploiting thread-level parallelism at multiple levels to achieve a good load balance.
Considering these factors, you could use the data-domain decomposition as your multithreading scheme. Details are described in the following two sub-sections.
When you have decided on the functional decomposition or data domain decomposition scheme, the next step is to decide the granularity for each thread. One possible scheme of data domain decomposition is to divide a frame into small slices.
Parallelizing the slices has both advantages and disadvantages. The advantage lies in the independence of slices in a frame. Since they are independent, you can simultaneously encode all slices in any order. On the other hand, the disadvantage is the resulting increase in the bit rate.
Figure 2 below shows the video encoder rate-distortion when you divide a frame into varying numbers of slices. When a frame is divided into nine slices but quality is held at the same level, the bit-rate increases about 15 to 20 percent because slices break the dependence between macroblocks.
|Figure 2. Encoded Picture Quality Versus the Number of Slices in a Picture|
The compression efficiency decreases when a macroblock in one slice cannot exploit a macroblock in another slice for compression. To avoid increasing the bit-rate at the same video quality, you should exploit other areas of parallelism in the video encoder.
Another possible scheme for exploiting parallelism is to identify independent frames. Normally, you would encode a sequence of frames using an IBBPBBP structure.
( A note on frame notation definitions: I frame in video codecs stands for intra frames, which can be encoded or decoded independently. Normally, there is an I frame per 15~60 frames. P frame stands for predicted frames, each of which is predicted from a previously encoded I frame or P frame.
Because a P frame is predicted from the previously encoded I/P frame, the dependency makes it harder to encode two P frames simultaneously. B frame stands for bi-directional predicted frames, which are predicted from a two previously encoded I/P frames. No frame depends on B frames.
In each sequence, you have two B frames between each of the P frames. P frames are reference frames, on which other P or B frames depend, but B frames are not. The dependence among the frames is shown in Figure 3 below . In this PBB encoding structure, the completion of encoding a particular P frame makes the subsequent P frame and its two B frames ready for encoding.
The more frames encoded simultaneously, the more parallelism you can exploit. Therefore, P frames are the critical point in the encoder. Accelerating P-frame encoding makes more frames ready for encoding and avoids the possibility of idle threads. In your implementation, encode I or P frames first, then encode the B frames.
Unlike dividing a frame into slices, utilizing parallelism among frames does not increase the bit rate. However, the dependence among them does limit the threads' scalability. The trade-off is to combine the two approaches, data-domain decomposition and functional decomposition, into one implementation.
First explore the parallelism among frames to gain performance from it without bit-rate increase. At some point, you are going to reach the upper limit of the number of threads to which you can apply frame-level parallelism. When you do, explore the parallelism among slices. As a final result, the application utilizes processor resources as much as possible, keeping the compression ratio as high as possible and the bit-rate as low as possible.
|Figure 3. Data Dependences among Frames|
Implementation Using Two Slice Queues
The H.264 encoder is divided into three parts: input pre-processing, encoding, and output post-processing. Input pre-processing reads uncompressed images, performs some preliminary processes, and then issues the images to encoding threads.
The pre-processed images are placed in a buffer, called the image buffer. Output processing checks the encoding status of each frame and commits the encoded result to the output bit-stream sequentially. After that, the entries in the image buffer are reused to prepare the image for encoding.
Although the input and output processes of the encoder must be sequential due to the inherent parallelism of the encoder, the computation complexity of input and output processes is insignificant compared to the encode process. Therefore, you can use one thread to handle the input and output processes. This thread becomes the master thread in charge of checking all the data dependency.
You would use another buffer, called slice buffer, to exploit the parallelism among slices. After each image is pre-processed, the slices of the image go into the slice buffer. The slices placed in the slice buffer are independent and ready for encoding; the readiness of reference frames is checked during the input process.
In this case, you can encode these slices out of order. To distinguish the priority differences between the slices of B frames and the slices of I or P frames, use two separate slice queues to handle them. The pseudocode in Example 1 below implements this two-slice model.
|Example 1. Slice-Queue Model for Parallelism in the H.264 Encoder|
Figure 4 below shows how the video stream is processed by the final multithreading implementation of a parallelized H.264 encoder. In the code segment, one thread processes both the input and the output, in order, and other threads encode slices out of order.
|Figure 4. Implementation with Image and Slice Buffers|
Implementation Using Task Queuing Model
The implementation in Example 1 uses the OpenMP pragma, making the structure of the parallel code very different from that of a sequential code. A second proposed implementation uses the taskqueuing model that is supported by the Intel C++ Compiler..
Essentially, for any given program with taskqueuing constructs, a team of threads is created by the run-time library when the main thread encounters a parallel region. Figure 5 below shows the taskqueuing execution model.
The run-time thread scheduler chooses one thread (TK) to execute initially from all the threads that encounter a taskq pragma. All the other threads wait for work to be put on the work queue. Conceptually, the taskq pragma triggers this sequence of actions:
1. Causes an empty queue to be created by the chosen thread TK
2. Enqueues each task that it encounters
3. Executes the code inside the taskq block as a single thread
The task pragma specifies a unit of work, potentially to be executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is placed on the queue associated with the taskq pragma. The conceptual queue is disbanded when all work enqueued on it finishes and the end of the taskq block is reached.
|Figure 5. Taskqueuing Execution Model|
The first proposed multithreaded H.264 scheme uses two FIFO buffers: an image buffer and a slice buffer. The main thread is in charge of three activities:
1. Moving raw images into the image buffer when the image buffer has space
2. Moving slices of the image buffer into slice buffers when the slice buffer has space and the image is not yet dispatched
3. Moving the encoded images out the image buffer when the image is encoded
The working threads are in charge of encoding new slices when a slice is waiting in the slice buffer to be encoded. All these operations are synchronized through the image buffers. Hence, you would find it natural to use the task queuing model supported by the Intel compiler.
The code segment in Example 2 below shows the pseudo-code of the multithreading of the H.264 encoder using the taskqueuing model. This multithreaded source code is closer to the way you would write single thread code. The only difference is the pragma, which is a key characteristic of OpenMP. Furthermore, in this scheme, you no longer have a control thread, only a number of working threads in total.
|Example 2. Task-Queue Model for Theading the H.264 Encoder|
Next in Part 2: Tradeoffs: Increased Speed vs. Effective Compression
Copyright 2008 Intel Corporation. All rights reserved. This article is based on material The Software Optimization Cookbook, Second Edition by Richard Gerber, Aart J.C. Bik, Kevin B. Smith, and Xinmin Tian and used with the permission of Intel Press.
Richard Gerber has worked on numerous multimedia projects, 3D libraries, and computer games for Intel. As a software engineer, he worked on the Intel VTune Performance Analyzer and led training sessions on optimization techniques. Richard is the original author of The Software Optimization Cookbook and co-author of Programming with Hyper-Threading Technology.
Aart J.C. Bik holds a PhD in computer science and is a Principal Engineer at Intel Corporation, working on the development of high performance Intel' C++ and Fortran compilers. Aart received an Intel Achievement Award, the company's highest award, for making the Intel Streaming SIMD Extensions easier to use through automatic vectorization. Aart is the author of The Software Vectorization Handbook.
Kevin B. Smith is a software architect for Intel's C and FORTRAN compilers. Since 1981 he has worked on optimizing compilers for Intel 8086, 80186, i960, Pentium, Pentium Pro, Pentium III, Pentium 4, and Pentium M processors.
Xinmin Tian holds a PhD in computer science and leads an Intel development group working on exploiting thread-level parallelism in high-performance Intel' C++ and Fortran compilers for Intel Itanium, IA-32, Intel EM64T, and multi-core architectures.