Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps

Jeremy Saldate, Intel - November 16, 2007

A wide range of new applications are entering the mainstream of desktop, server and portable/mobile computing applications - including data mining, database management, complex search and pattern matching algorithms, as well as compression algorithms for audio, video, images and data.

To more efficiently perform such complex operations, the Intel Streaming SIMD Extensions 2 (Intel SSE2) set of instructions have been modified, implementing a much more efficient means of completing tasks such as motion vector search in video processing, packed dword multiplies by compilers, and improving throughput between the CPU and graphics processor to benefit a wide scope of applications. This improved instruction set is being released as Intel SSE4.

This article provides a quick overview of the new extensions to the instruction set architecture (ISA) and includes examples of how you can use the new instructions to optimize video encoding functions. We will also look at the new streaming load instruction, which can significantly improve the performance of applications that share data between the graphics processor and the CPU.

The 45 nanometer next generation Intel Core 2 processor family (Penryn) includes support for the Intel SSE4 instruction set as an extension to the Intel 64 Instruction Set Architecture (ISA). These new instructions deliver performance gains for SIMD (single instruction, multiple data) software and will enable the new family of microprocessors to deliver improved performance and energy efficiency with a broad range of 32-bit and 64-bit software.

54 new instructions
Intel SSE4 is a set of 54 new instructions designed to improve performance and energy efficiency of media, 3D and gaming applications. The Penryn microarchitecture supports 47 of these new instructions, with the remainder being introduced on future processors. We can group them into three areas:

1) Video accelerators include instructions to accelerate 4x4 sum absolute difference (SAD) calculations, sub-pixel filtering and horizontal minimum search. The Intel SSE4 video accelerator instructions include a sum absolute difference engine that can perform eight SAD calculations at once.

New instructions also include an instruction for horizontal minimum search that can look at eight values and identify the minimum value and an index of that minimum value, as well as instructions for converting packed integers into wider data types, allowing for faster integer-to-float conversions in 3D applications.

2) Graphics building blocks include common graphics primitives generalized for compiler auto-vectorizations, such as packed dword multiply and floating point dot products.

3) The Streaming Load instruction enables faster reads from Uncacheable Speculative Write Combining (USWC) memory. When used in conjunction with Intel SSE2 streaming writes, Streaming Load allows for faster Memory Mapped I/O (MMIO).

Now we will take a closer look at how you can use Intel SSE4 to improve performance and energy efficiency using new instructions for Motion Vector Search and Streaming Loads.

Optimizing motion vector search
Motion estimation is one of the main bottlenecks in video encoders. It involves searching reference frames for best matches, and it can consume as much as 40 percent of the total CPU cycles used by the encoder. Search quality is an important determinant for the ultimate video quality of encoded video.

For this reason, algorithmic and SIMD optimizations designed to improve encoding speed often target the search operation. SIMD instructions provide an ideal means of optimizing motion estimation because the required arithmetic operations are performed on blocks of pixels with a high degree of parallelism.

The Intel SSE2 instruction PSADBW is widely used to optimize this operation. PSADBW computes the sum of absolute differences from a pair of 16 unsigned byte integers. One sum is the result from the eight lower differences, while the other sum is the result from the eight upper differences.

The PSADBW instruction finds the matching blocks for four 4x4 blocks in each call. Since the width of a 4x4 block is only 4 bytes, in order to use this instruction, two rows are first unpacked to concatenate two 4-byte data sets into 8 bytes.

Since each load gets 16 consecutive bytes, data is loaded from four consecutive blocks in one load. Therefore, it makes sense to write this function to find the matching block for four blocks in each call.

The new MPSADBW instruction
The new MPSADBW instruction improves performance by computing eight sums of difference in a single instruction. Each sum is computed from the absolute difference of a pair of four unsigned byte integers.

Figure 1 below shows how the eight sums are computed using MPSADBW.

MPSADBW takes an immediate as a third operand:

* Bits 0 and 1 of the immediate value are used to select one of the four groups of 4 bytes from the source operand.

* Bit 2 of the immediate value is used to select one of the two groups of 11 bytes from the destination operand.

Figure 1: Computing eight sums using MPSADBW[1]

In Figure 1 above, the box with the darkened solid outline is the selected block. The box with the darkened broken outline represents other blocks that could be selected by setting the corresponding bits in the immediate. While the ideal block size for this instruction is 4x4, other block sizes, such as 8x4, or 8x8, can also benefit from MPSADBW.

As shown in Figure 1, Bits 0 and 1 of the immediate value are used to select a different 4-pixel group for the computation. To compute sums of absolute difference for block sizes that are multiples of 4x4, we repeat the MPSADBW operation using a different immediate value each time, and then add the results from the multiple MPSADBW operations using PADDUSW to yield the final results.

Example: horizontal minimum search
After the sums are computed, you use the PHMINPOSUW instruction to locate the minimum from the computed SADs, as shown in the code example in Figure 2 below.

Figure 2: Code example for the Optimized Integer Block Matching function " finding the matching block for four 4x4 blocks in each call.[2]

Fast SAD calculations
MPSADBW is very useful for fast SAD calculations. MPSAD calculates and sums 32 absolute differences per instruction, which is double the yield of PSAD. You can use the MPSADBW instruction to compute sums of absolute difference, not only for 4-byte wide blocks, but also for 8-byte wide and 16-byte wide blocks.

The built-in source shifts allow the developer to avoid many of the unaligned loads or ALIGN instructions which would otherwise be required, contributing to the overall efficiency and performance of the application.

Table 1 below shows the speed-ups from these optimizations. The results are expressed as cycles per block SAD that we computed. The speed-up column contains ratios that we computed using the Intel SSE2 results as the baseline. We tested three different block sizes, 4x4, 8x8 and 16x16.

Table 1: Speed-Up Results for Motion Vector Search measured in number of cycles per block SAD " Intel SSE4 can provide 1.6 to 3.8x faster performance than Intel SSE2[3] (The Intel Compiler 10.0.018 beta was used to build the code. The 'O2' and 'QxS' compiler flags were used. 'QxS' is a new flag for the compiler to generate optimized code specifically for Penryn).

In addition to the Intel SSE4 instructions, multi-threaded implementations can help you achieve additional performance gains in compute-intensive applications such as video codecs.

Memory Mapped I/O devices
UC (uncacheable) memory refers to the fact that this type of memory does not store data in any of the processor's caches. Uncacheable Speculative Write Combining (USWC) memory is an extension to UC memory that contains uncacheable data that is typically subject to sequential write operations, such as writing to frame buffer memory.

As its name indicates, write-combining allows all writes to a cache line to be combined before they are written out to memory. USWC memory is often used in Memory Mapped I/O (MMIO) devices.

Intel SSE2 introduced the MOVNTDQ instruction for streaming writes. It improves write throughput by streaming uncacheable writes to devices that are memory-mapped to USWC memory. The streaming write instruction tells the processor that the target data is to be written directly to external memory and not cached into any level of the processor's cache hierarchy.

Streaming writes allow write-combining to the same cache line (typically 64-bytes) going out to memory, as opposed to writing to memory in 16-byte chunks to substantially improve bus utilization and write throughput.

The new MOVNTDQA instruction
Although USWC memory is primarily employed for memory that is subject to write operations (such as frame buffer memory), load operations may also occur.

For example, today's sophisticated graphics devices can perform rendering operations while the CPU performs other tasks. The CPU can then load in the data for further processing and display. One problem with this is that SSE load instructions operate on a maximum of 16-byte chunks and have limited throughput when accessing USWC memory. The operation requires two separate front-side bus transactions and consumes four bus clocks.

To relieve this bottleneck, Intel SSE4 introduces the new MOVNTDQA instruction for streaming loads, improving read throughput by streaming uncacheable reads from USWC memory. MOVNTDQA allows the fetching of a 16-byte chunk within an aligned cache line of USWC memory. Similar to streaming writes, the streaming load instruction allows data to be moved in full 64-byte cache line quantities as opposed to a 16-byte chunk, substantially improving bus throughput.

Loading a full cache line
Fetching a complete 64-byte cache line and temporarily holding the contents in a streaming load buffer enables the supply of subsequent loads without using the front side bus (FSB), which makes these loads much faster. Figure 3 provides a code example.

Figure 3: Code example of loading a full cache line using the MOVNTDQA instruction[4]

For streaming across multiple line addresses, loads of all four chunks of a given line should be batched together. Note that the loading of each chunk using MOVNTDQA instructions does not have to be in the order shown in the example.

It is also important to note that a streaming load of a given chunk will cause a new streaming load buffer allocation if one does not currently exist. Because there are a finite number of streaming load buffers within any given micro-architectural implementation, grouping the chunks together will generally improve overall utilization.

Streaming Load programming models
There are two common programming models for using streaming loads: bulk load and operate and incremental load and operate.

Bulk Load and Operate. Here the application loads the data using streaming loads and copies it as a bulk load to a temporary cacheable (WB) buffer. After all the data has been loaded, the CPU operates on the temporary buffer and then sends the data back to memory.

Incremental Load and Operate. In this model the application loads a single cache line using streaming load, operates on the data, and then writes it back to memory. This model operates on the data as it is loaded, as opposed to loading a large amount of data first, as is the case with the Bulk Load and Operate model.

The Bulk Load and Operate programming model generates more consistent performance gains. In the Incremental Load and Operate programming model, the data operations performed between streaming loads could interfere with streaming load performance due to contention for buffers and other processor resources. The Bulk Load and Operate programming model reduces this possibility by performing data loading and operations in separate batches.

Figure 4: Memory throughput from system memory using Streaming Load - Intel SSE4 can lead to a 5.0 " 7.5x speed increase over traditional loads[5]. Specifically, Memory Throughput = (Processor Frequency * Number of Iterations * Data Copied per Iteration) / Total Execution Time (in clock cycles). For dual-threaded implementations, the memory throughput was calculated for each thread and then added together. The test system consisted of a Next generation Intel Core 2 desktop processor (Wolfdale), Intel D975XBX2KR motherboard, 2 GB DDR2 RAM PC2-8000 (667 MHz), Windows* XP Professional with Service Pack 2.

Performance gains
To measure the performance of streaming loads from local system memory, we used four implementations of the Bulk Load and Operate programming model to load 4KB of data from USWC memory and copy the data to a cacheable WB buffer:

* One implementation without streaming loads using the MOVDQA instruction

* Another implementation using streaming loads with the new MOVNTDQA instruction

* Two additional implementations consisting of dual-threaded versions that split the USWC segment into two parts and performed the load and copy in two threads, each bound to a separate core.

For each implementation, the 4KB load and copy loop executed approximately 10,000 iterations, and we calculated the average memory throughput by measuring the total time required.

Figure 4 earlier shows the memory throughput improvements. In the single-threaded implementation, using streaming loads increased memory throughput by more than 5x. In the dual-threaded implementation, streaming loads increased memory throughput more than 7.5x.

Achieving the benefits
Intel SSE4 can improve the performance and energy efficiency of a broad range of applications, including media editors, encoders, 3D applications and games. Of course, performance gains will vary by workload and application.

So what do you need to do to incorporate the benefits of Intel SSE4 instructions into your applications?

Intel provides a variety of tools and performance libraries optimized for Intel SSE4. One such tool is the Intel C++ Compiler 10.0, which supports automatic vectorization.

The vectorization process parallelizes code by analyzing loops to determine when it may be appropriate to execute several iterations of the loop in parallel by utilizing MMX, SSE, Intel SSE2, SSE3 and Intel SSE4 instructions. Vectorization may be useful as a way to optimize application code and take advantage of new extensions when running applications on Intel processors.

While some applications might achieve performance gains by simply recompiling with this compiler, you will obtain maximum gains by manually optimizing your applications. Some of the highest-value Intel SSE4 instructions will require careful integration using intrinsic or assembly development, and may require algorithm changes. For example, achieving the benefits of streaming load will require manual integration, but the payback will be significantly improved performance.

To get started, visit the Intel Software Network.  This site includes white papers and a downloadable Software Developers Kit (SDK) for Penryn and Intel SSE4. For more information on Intel SSE4 instruction set innovation, click here. And to learn more about architecture and silicon technology, visit this location on Intel's web site.

Jeremy Saldate is Senior Technical Marketing Engineer, Software and Solutions Group, Intel Corp.

[1], [2], [3] Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4), White Paper. Intel Software Solutions Group, 2007

[4], [5] Increasing Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4) Streaming Load, White Paper. Intel Software Solutions Group, 2007