Using the new Intel Streaming SIMD Extensions (SSE4) for audio, video and image apps
Performance gains
To measure the performance of streaming loads from local system memory,
we used four implementations of the Bulk Load and Operate programming
model to load 4KB of data from USWC memory and copy the data to a
cacheable WB buffer:
* One implementation without streaming loads using the MOVDQA
instruction
* Another implementation using streaming loads with the new MOVNTDQA
instruction
* Two additional implementations consisting of dual-threaded
versions that split the USWC segment into two parts and performed the
load and copy in two threads, each bound to a separate core.
For each implementation, the 4KB load and copy loop executed
approximately 10,000 iterations, and we calculated the average memory
throughput by measuring the total time required.
Figure 4 earlier shows the
memory throughput improvements. In the single-threaded implementation,
using streaming loads increased memory throughput by more than 5x. In
the dual-threaded implementation, streaming loads increased memory
throughput more than 7.5x.
Achieving the benefits
Intel SSE4 can improve the performance and energy efficiency of a broad
range of applications, including media editors, encoders, 3D
applications and games. Of course, performance gains will vary by
workload and application.
So what do you need to do to incorporate the benefits of Intel SSE4
instructions into your applications?
Intel provides a variety of tools and performance libraries
optimized for Intel SSE4. One such tool is the Intel C++ Compiler 10.0,
which supports automatic vectorization.
The vectorization process parallelizes code by analyzing loops to
determine when it may be appropriate to execute several iterations of
the loop in parallel by utilizing MMX, SSE, Intel SSE2, SSE3 and Intel
SSE4 instructions. Vectorization may be useful as a way to optimize
application code and take advantage of new extensions when running
applications on Intel processors.
While some applications might achieve performance gains by simply
recompiling with this compiler, you will obtain maximum gains by
manually optimizing your applications. Some of the highest-value Intel
SSE4 instructions will require careful integration using intrinsic or
assembly development, and may require algorithm changes. For example,
achieving the benefits of streaming load will require manual
integration, but the payback will be significantly improved
performance.
To get started, visit the Intel
Software Network. This site includes white papers and a
downloadable Software Developers Kit (SDK) for Penryn and Intel SSE4.
For more information on Intel SSE4 instruction set innovation,
click here. And to learn more about architecture and silicon
technology, visit
this location on Intel's web site.
Jeremy Saldate is Senior Technical
Marketing Engineer, Software and Solutions Group, Intel Corp.
References
[1], [2], [3] Motion
Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4),
White Paper. Intel Software Solutions Group, 2007
[4], [5] Increasing
Memory Throughput With Intel Streaming SIMD Extensions 4 (Intel SSE4)
Streaming Load, White Paper. Intel Software Solutions Group, 2007