To evaluate the techniques described in Part 1 of this series, performance measurements for the multithreaded encoder are the result of experiments conducted on the following systems:
* A Dell Precision 530 system, built with dual Intel Xeon processors (four logical processors) running at 2.0 GHz with HT Technology, a 512 KB L2 Cache, and 1 GB of memory
* An IBM eServer xSeries 360 system, built with quad Intel Xeon processors (eight logical processors) running at 1.5 GHz with HT Technology, a 256 KB L2 Cache, a 512 KB L3 Cache, and 2 GB of memory.
Unless specified otherwise, the resolution of the input video is 352×288 in pixels or 22×18 in macroblocks. To be sure to provide enough slices for eight threads, the program takes the slice as the basic encoding unit for each thread.
Framing the problem
A frame can be partitioned up to a maximum of 18 slices. Taking a slice as the base encoding unit for a thread can reduce the synchronization overhead because no data dependency among slices occurs within a single frame during the encoding process.
As mentioned earlier, partitioning the frame into multiple slices can increase the degree of parallelism, but, it also increases the bit-rate. One of the challenges is to achieve an increased execution speed and lower the bit-rate without sacrificing any image quality. Therefore, you should choose the slicing threshold carefully.
|Figure 6. Speed-up and Bit-rate Versus the Number of Slices in a Frame w/o HT Technology|
Figure 6 above and Figure 7 below show the combinations of increased encoding speed and the associated bit rate for two variations of the number of slices for each frame. In Figure 6, the number of slices ranges from 1 to 18, while maintaining a constant quality level for the encoded frames.
Speed increases when the number of slices for a frame is 1 to 2 on the DELL 530 platform, and the speedup is almost flat when the per-frame number of slices ranges from 2 to 18. Meanwhile, the bitrate increase is smaller if the number of slices is less than 3, but it starts going up as the frames go from 3 slices to 18 slices. One important observation is that partitioning a frame into 2 or 3 slices is the best tradeoff, one that achieves a higher speedup and a lower bit rate.
|Figure 7. Speed-up and Bit-rate Versus the Number of Slices in a Frame with HT Technology|
Figure 7 shows that we need more than three slices to keep eight logical processors busy on the IBM x360 platform. Essentially, we need nine threads to achieve an optimal performance level for four physical processors with HT Technology enabled.
You want to keep the number of slices roughly same as the number of logical processors. This simple approach achieves higher performance. You can maintain good image quality with an optimal tradeoff while generating enough slices to keep threads busy for encoding.
Multiprocessor Performance with HyperThreading
Table 2 below shows the speed increase for the threaded encoder on the IBM x360 quad-processor system with HyperThreading. In this implementation, a picture frame was partitioned into nine slices. In general, the multithreaded H.264 encoder increased its execution speed in the following ranges: 1.9x to 2.01x on a two”processor system, 3.61x to 3.99x on a four”processor system, and 3.97x to 4.69x on a four”processor system with HT technology enabled for five different input video sequences.
|Table 2. Speedups on Different Video Sequences Using Two Slices Queues|
You can see some performance differences between the first implementation with two-slice queues and the second implementation with only one task queue, shown in Table 3 below.
The performance gap is larger when the system contains more processors. Because the implementation uses two queues to accelerate the encoding of I or P frames, it can make more slices ready for encoding, especially when a large number of processors is available to do the work.
On the other hand, the taskqueuing model in OpenMP maintains only one queue. In this case, all slices are treated equally. Therefore, the execution threads spend more time in an idle state when the system has a lot of processors.
|Table 3. Speedups on Different Video Sequences using One Task Queue|
With HT Technology enabled, the program achieved a 1.2x speed increase. The explanation for this improvement lies in the microarchitecture metrics in the next section.
Understanding the Performance
Table 4 below shows the distribution of the number of instructions retired per cycle on a Dell Precision 530 dual-processor system with the second processor disabled. Although no instruction is retired for almost half of the execution time, the probability of retiring more instructions is higher with HT Technology. This statistic indicates that higher processor utilization is achieved with HT Technology.
|Table 4. Percentage Breakdown of Instructions Retired Using VTune Analyzer|
Table 5 and Table 6 below show mixed results. Without HT Technology, the trace cache spends about 80 percent of the time under the deliver mode, which is good for performance, and about 18 percent of the time under the build mode, which is bad for performance.
|Table 5. Microarchitecture Metric on Dell Precision 530 System|
However, when HT Technology is enabled, the deliver mode percentage drops to 70 percent while the build mode percentage increases to 25 percent. This performance drop indicates that the front end of the system with HT Technology cannot provide enough micro-ops to the execution unit.
Similarly, the miss rate for the first-level cache load also shows the same decline. You see a 50-percent increase in the number of first-level cache misses when HT Technology is enabled.
|Table 6. Microarchitecture Metric on IBM x360 System|
This 6″to”9-percent increase in the miss rate results from the two logical processors in one physical package sharing the first-level cache of only 8 kilobytes. In short, performance gains for HT Technology are limited by the trace cache and the L1 cache for our multithreaded H.264 encoder.
Front-side-bus utilization rate is the only noticeable impact on microarchitecture metrics for multiprocessor configuration. The number of bus activities does not increase significantly along with the increasing of number of threads. The execution time is reduced due to the better use of processor resources that you get by exploiting enough thread-level parallelism. The result is an increased front-side-bus utilization rate.
Table 3 earlier also shows that the execution time is even longer on a quad-processor with HT Technology (QP+HT) than a quad-processor (QP) in the case of a smaller slice number. This increase can be explained from the profile of threads. Figure 8 below shows the profile when a frame contains only one slice. The encoder thread is waiting about 61.8 percent of the execution time due to insufficient parallelism.
|Figure 8. Execution Time Profile for Two Slice Queues with One Slice in a Frame with the Intel Thread Profiler|
Figure 9 below shows the profile when 18 slices are in a frame. The eight encoder threads are all busy except during the set-up time. The eight encoder threading model is waiting only 1.4 percent of the execution time. In this case, all processor resources are used fully.
|Figure 9. Execution Time Profile for Two Slice Queues with 18 Slices in a Frame with the Intel Thread Profiler|
Therefore, during the process of doing trade-off analysis, you should choose carefully the best way to balance the slices in a frame. The criterion is to keep the number of slices low while providing enough slices to keep all encoder threads busy. If the number of slices is smaller than the number of threads, the execution speed decreases.
Figure 10 below shows the execution time profile of the second implementation using one task queue. As mentioned earlier, all slices are treated equally because the taskqueuing model in OpenMP only maintains one queue. Therefore, the system could have too few ready-to encode slices, as you can see from the amount of idle time in the execution threads. Compared to Figure 9, Figure 10 below shows that the processors are utilized less efficiently.
|Figure 10. Execution Time Profile Using One Task Queue with the Intel Thread Profiler|
In summary, having the number of threads equal to the number of logical processors strikes the best balance between speed-up and parallelism. But, what happens to the performance when the number of threads is greater or less than the number of logical processors?
Figure 11 below shows that the speed-up changes along with the number of threads for an implementation using two slice queues. The speed increases along with increasing of the number of threads, reaching peak performance when the number of threads equals to the number of logical processors.
|Figure 11. Speedups Versus Number of Threads on a 2-way Dell System|
An interesting observation is that the speedup is essentially flat, or it drops only slightly when the number of threads is greater than the number of logical processors.
Thus, the overhead due to threading is minor. In other words, the multithreaded code generated by the compiler exploits effective parallelism efficiently, and the overhead of the multithreaded run-time library is small.
Furthermore, the multithreaded H.264 encoder should have good scalability for mediumscale multiprocessor systems, such as the one shown in Figure 12 below, because the performance is not sensitive to the number of threads.
|Figure 12. Speedups Versus Number of Threads on 4-way IBM x360 System|
Further Performance Tuning
Even when the expected performance gain is achieved, one can always find some further work to do. In this case, you could analyze the performance impact from different image resolutions. While the resolution of source image can scale from QCIF, CIF, SD to HDTV, most of our current analysis focused on the CIF resolution.
Figure 5 shows that the increased speed of SD (720×480) format is slightly less than that of CIF (352×288) format. While the speedup is determined by factors such as synchronization and degree of parallelism, Figure 13 below shows that the number of synchronizations per second during encoding SD video is less than that of encoding CIF video.
Furthermore, SD has a higher degree of parallelism. We could do better to understand the reasons that the speedup of encoding higher resolution video is less than that of lower resolution video.
|Figure 13. Synchronizations per Second during Encoding|
As the emerging codec standard becomes more complex, the encoding and decoding processes require much more computation power than most existing standards. The H.264 standard includes a number of new features and requires much more computation than most existing standards, such as MPEG-2 and MPEG-4.
Even after media instruction optimization, the H.264 encoder at CIF resolution still is not fast enough to meet the expectation of real-time video processing. Thus, exploiting thread-level parallelism to improve the performance of H.264 encoders is becoming more attractive.
The case study presented here shows that multithreading based on the OpenMP programming model is a simple, yet effective way to exploit parallelism that only requires a few additional pragmas in the serial code. Developers can rely on the compiler to convert the serial code to multithreaded code automatically via adding OpenMP pragmas.
The performance results have shown that the code generated by the Intel compiler delivers optimally increased speed over the well-optimized sequential code on the architecture with Hyper-Threading, often boosting performance by 20 percent on top of native parallel speedups, ~4x without HT in this case, with very little additional cost.
To read Part 1 go to Threading a video codec.
Copyright 2008 Intel Corporation. All rights reserved. This article is based on material The Software Optimization Cookbook, Second Edition by Richard Gerber, Aart J.C. Bik, Kevin B. Smith, and Xinmin Tian and used with the permission of Intel Press.
Richard Gerber has worked on numerous multimedia projects, 3D libraries, and computer games for Intel. As a software engineer, he worked on the Intel VTune Performance Analyzer and led training sessions on optimization techniques. Richard is the original author of The Software Optimization Cookbook and co-author of Programming with Hyper-Threading Technology.
Aart J.C. Bik holds a PhD in computer science and is a Principal Engineer at Intel Corporation, working on the development of high performance Intel C++ and Fortran compilers. Aart received an Intel Achievement Award, the company's highest award, for making the Intel Streaming SIMD Extensions easier to use through automatic vectorization. Aart is the author of The Software Vectorization Handbook.
Kevin B. Smith is a software architect for Intel's C and FORTRAN compilers. Since 1981 he has worked on optimizing compilers for Intel 8086, 80186, i960, Pentium, Pentium Pro, Pentium III, Pentium 4, and Pentium M processors.
Xinmin Tian holds a PhD in computer science and leads an Intel development group working on exploiting thread-level parallelism in high-performance Intel C++ and Fortran compilers for Intel Itanium, IA-32, Intel EM64T, and multi-core architectures.