Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps
A wide range of new applications are entering the mainstream of
desktop, server and portable/mobile computing applications - including
data mining, database management, complex search and pattern matching
algorithms, as well as compression algorithms for audio, video, images
and data.
To more efficiently perform such complex operations, the Intel
Streaming SIMD Extensions 2 (Intel SSE2) set of instructions have been
modified, implementing a much more efficient means of completing tasks
such as motion vector search in video processing, packed dword
multiplies by compilers, and improving throughput between the CPU and
graphics processor to benefit a wide scope of applications. This
improved instruction set is being released as Intel SSE4.
This article provides a quick overview of the new extensions to the instruction set
architecture (ISA) and includes examples of how you can use the new
instructions to optimize video encoding functions. We will also look at
the new streaming load instruction, which can significantly improve the
performance of applications that share data between the graphics
processor and the CPU.
The 45 nanometer next generation Intel Core 2 processor family (Penryn) includes
support for the Intel SSE4 instruction set as an extension to the Intel
64 Instruction Set Architecture (ISA). These new instructions deliver
performance gains for SIMD (single instruction, multiple data)
software and will enable the new family of microprocessors to deliver
improved performance and energy efficiency with a broad range of 32-bit
and 64-bit software.
54 new instructions
Intel SSE4 is a set of 54 new instructions designed to improve
performance and energy efficiency of media, 3D and gaming applications.
The Penryn microarchitecture supports 47 of these new instructions,
with the remainder being introduced on future processors. We can group
them into three areas:
1) Video accelerators
include instructions to accelerate 4x4 sum absolute difference (SAD)
calculations, sub-pixel filtering and horizontal minimum search. The
Intel SSE4 video accelerator instructions include a sum absolute
difference engine that can perform eight SAD calculations at once.
New instructions also include an instruction for horizontal minimum
search that can look at eight values and identify the minimum value and
an index of that minimum value, as well as instructions for converting
packed integers into wider data types, allowing for faster
integer-to-float conversions in 3D applications.
2) Graphics building blocks
include common graphics primitives generalized for compiler
auto-vectorizations, such as packed dword multiply and floating point
dot products.
3) The Streaming Load
instruction enables faster reads from Uncacheable Speculative Write
Combining (USWC) memory. When used in conjunction with Intel SSE2
streaming writes, Streaming Load allows for faster Memory Mapped I/O
(MMIO).
Now we will take a closer look at how you can use Intel SSE4 to
improve performance and energy efficiency using new instructions for
Motion Vector Search and Streaming Loads.
Optimizing motion vector search
Motion estimation is one of the main bottlenecks in video encoders. It
involves searching reference frames for best matches, and it can
consume as much as 40 percent of the total CPU cycles used by the
encoder. Search quality is an important determinant for the ultimate
video quality of encoded video.
For this reason, algorithmic and SIMD optimizations designed to
improve encoding speed often target the search operation. SIMD
instructions provide an ideal means of optimizing motion estimation
because the required arithmetic operations are performed on blocks of
pixels with a high degree of parallelism.
The Intel SSE2 instruction PSADBW is widely used to optimize this
operation. PSADBW computes the sum of absolute differences from a pair
of 16 unsigned byte integers. One sum is the result from the eight
lower differences, while the other sum is the result from the eight
upper differences.
The PSADBW instruction finds the matching blocks for four 4x4 blocks
in each call. Since the width of a 4x4 block is only 4 bytes, in order
to use this instruction, two rows are first unpacked to concatenate two
4-byte data sets into 8 bytes.
Since each load gets 16 consecutive bytes, data is loaded from four
consecutive blocks in one load. Therefore, it makes sense to write this
function to find the matching block for four blocks in each call.
The new MPSADBW instruction
The new MPSADBW instruction improves performance by computing eight
sums of difference in a single instruction. Each sum is computed from
the absolute difference of a pair of four unsigned byte integers.
Figure 1 below shows how the eight sums are computed using MPSADBW.
MPSADBW takes an immediate as a third operand:
* Bits 0 and 1 of the immediate value are used to select one of the
four groups of 4 bytes from the source operand.
* Bit 2 of the immediate value is used to select one of the two
groups of 11 bytes from the destination operand.
 |
| Figure
1: Computing eight sums using MPSADBW[1] |
In Figure 1 above, the box
with the darkened solid outline is the selected block. The box with the
darkened broken outline represents other blocks that could be selected
by setting the corresponding bits in the immediate. While the ideal
block size for this instruction is 4x4, other block sizes, such as 8x4,
or 8x8, can also benefit from MPSADBW.
As shown in Figure 1, Bits 0 and 1 of the immediate value are used
to select a different 4-pixel group for the computation. To compute
sums of absolute difference for block sizes that are multiples of 4x4,
we repeat the MPSADBW operation using a different immediate value each
time, and then add the results from the multiple MPSADBW operations
using PADDUSW to yield the final results.
Example: horizontal minimum search
After the sums are computed, you use the PHMINPOSUW instruction to
locate the minimum from the computed SADs, as shown in the code example
in Figure 2 below.
 |
| Figure
2: Code example for the Optimized Integer Block Matching function "
finding the matching block for four 4x4 blocks in each call.[2] |