CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps



Embedded.com
A wide range of new applications are entering the mainstream of desktop, server and portable/mobile computing applications - including data mining, database management, complex search and pattern matching algorithms, as well as compression algorithms for audio, video, images and data.

To more efficiently perform such complex operations, the Intel Streaming SIMD Extensions 2 (Intel SSE2) set of instructions have been modified, implementing a much more efficient means of completing tasks such as motion vector search in video processing, packed dword multiplies by compilers, and improving throughput between the CPU and graphics processor to benefit a wide scope of applications. This improved instruction set is being released as Intel SSE4.

This article provides a quick overview of the new extensions to the instruction set architecture (ISA) and includes examples of how you can use the new instructions to optimize video encoding functions. We will also look at the new streaming load instruction, which can significantly improve the performance of applications that share data between the graphics processor and the CPU.

The 45 nanometer next generation Intel Core 2 processor family (Penryn) includes support for the Intel SSE4 instruction set as an extension to the Intel 64 Instruction Set Architecture (ISA). These new instructions deliver performance gains for SIMD (single instruction, multiple data) software and will enable the new family of microprocessors to deliver improved performance and energy efficiency with a broad range of 32-bit and 64-bit software.

54 new instructions
Intel SSE4 is a set of 54 new instructions designed to improve performance and energy efficiency of media, 3D and gaming applications. The Penryn microarchitecture supports 47 of these new instructions, with the remainder being introduced on future processors. We can group them into three areas:

1) Video accelerators include instructions to accelerate 4x4 sum absolute difference (SAD) calculations, sub-pixel filtering and horizontal minimum search. The Intel SSE4 video accelerator instructions include a sum absolute difference engine that can perform eight SAD calculations at once.

New instructions also include an instruction for horizontal minimum search that can look at eight values and identify the minimum value and an index of that minimum value, as well as instructions for converting packed integers into wider data types, allowing for faster integer-to-float conversions in 3D applications.

2) Graphics building blocks include common graphics primitives generalized for compiler auto-vectorizations, such as packed dword multiply and floating point dot products.

3) The Streaming Load instruction enables faster reads from Uncacheable Speculative Write Combining (USWC) memory. When used in conjunction with Intel SSE2 streaming writes, Streaming Load allows for faster Memory Mapped I/O (MMIO).

Now we will take a closer look at how you can use Intel SSE4 to improve performance and energy efficiency using new instructions for Motion Vector Search and Streaming Loads.

Optimizing motion vector search
Motion estimation is one of the main bottlenecks in video encoders. It involves searching reference frames for best matches, and it can consume as much as 40 percent of the total CPU cycles used by the encoder. Search quality is an important determinant for the ultimate video quality of encoded video.

For this reason, algorithmic and SIMD optimizations designed to improve encoding speed often target the search operation. SIMD instructions provide an ideal means of optimizing motion estimation because the required arithmetic operations are performed on blocks of pixels with a high degree of parallelism.

The Intel SSE2 instruction PSADBW is widely used to optimize this operation. PSADBW computes the sum of absolute differences from a pair of 16 unsigned byte integers. One sum is the result from the eight lower differences, while the other sum is the result from the eight upper differences.

The PSADBW instruction finds the matching blocks for four 4x4 blocks in each call. Since the width of a 4x4 block is only 4 bytes, in order to use this instruction, two rows are first unpacked to concatenate two 4-byte data sets into 8 bytes.

Since each load gets 16 consecutive bytes, data is loaded from four consecutive blocks in one load. Therefore, it makes sense to write this function to find the matching block for four blocks in each call.

The new MPSADBW instruction
The new MPSADBW instruction improves performance by computing eight sums of difference in a single instruction. Each sum is computed from the absolute difference of a pair of four unsigned byte integers.

Figure 1 below shows how the eight sums are computed using MPSADBW.

MPSADBW takes an immediate as a third operand:

* Bits 0 and 1 of the immediate value are used to select one of the four groups of 4 bytes from the source operand.

* Bit 2 of the immediate value is used to select one of the two groups of 11 bytes from the destination operand.

Figure 1: Computing eight sums using MPSADBW[1]

In Figure 1 above, the box with the darkened solid outline is the selected block. The box with the darkened broken outline represents other blocks that could be selected by setting the corresponding bits in the immediate. While the ideal block size for this instruction is 4x4, other block sizes, such as 8x4, or 8x8, can also benefit from MPSADBW.

As shown in Figure 1, Bits 0 and 1 of the immediate value are used to select a different 4-pixel group for the computation. To compute sums of absolute difference for block sizes that are multiples of 4x4, we repeat the MPSADBW operation using a different immediate value each time, and then add the results from the multiple MPSADBW operations using PADDUSW to yield the final results.

Example: horizontal minimum search
After the sums are computed, you use the PHMINPOSUW instruction to locate the minimum from the computed SADs, as shown in the code example in Figure 2 below.

Figure 2: Code example for the Optimized Integer Block Matching function " finding the matching block for four 4x4 blocks in each call.[2]


1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Ready for a change?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :